diff --git a/README.md b/README.md index 0eaded6..dc356f1 100644 --- a/README.md +++ b/README.md @@ -1,106 +1,205 @@ -

+

-**ArXiv Digest and Personalized Recommendations using Large Language Models.** +# ArXiv Digest (Enhanced Edition) -This repo aims to provide a better daily digest for newly published arXiv papers based on your own research interests and natural-language descriptions, using relevancy ratings from GPT. +**Personalized arXiv Paper Recommendations with Multiple AI Models** -You can try it out on [Hugging Face](https://huggingface.co/spaces/AutoLLM/ArxivDigest) using your own OpenAI API key. - -You can also create a daily subscription pipeline to email you the results. +This repository provides an enhanced daily digest for newly published arXiv papers based on your research interests, leveraging multiple AI models including OpenAI GPT, Google Gemini, and Anthropic Claude to provide relevancy ratings, detailed analysis, and topic clustering. ## πŸ“š Contents -- [What this repo does](#πŸ”-what-this-repo-does) - * [Examples](#some-examples) -- [Usage](#πŸ’‘-usage) - * [Running as a github action using SendGrid (Recommended)](#running-as-a-github-action-using-sendgrid-recommended) - * [Running as a github action with SMTP credentials](#running-as-a-github-action-with-smtp-credentials) - * [Running as a github action without emails](#running-as-a-github-action-without-emails) - * [Running from the command line](#running-from-the-command-line) - * [Running with a user interface](#running-with-a-user-interface) -- [Roadmap](#βœ…-roadmap) -- [Extending and Contributing](#πŸ’-extending-and-contributing) +- [Features](#-features) +- [Quick Start](#-quick-start) +- [What This Repo Does](#-what-this-repo-does) +- [Model Integrations](#-model-integrations) +- [Design Paper Discovery](#-design-paper-discovery) +- [Output Formats](#-output-formats) +- [Setting Up and Usage](#-setting-up-and-usage) + * [Configuration](#configuration) + * [Running the Web Interface](#running-the-web-interface) + * [Running via GitHub Action](#running-via-github-action) + * [Running from Command Line](#running-from-command-line) +- [API Usage Notes](#-api-usage-notes) +- [Directory Structure](#-directory-structure) +- [Roadmap](#-roadmap) +- [Contributing](#-contributing) -## πŸ” What this repo does +## ✨ Features -Staying up to date on [arXiv](https://arxiv.org) papers can take a considerable amount of time, with on the order of hundreds of new papers each day to filter through. There is an [official daily digest service](https://info.arxiv.org/help/subscribe.html), however large categories like [cs.AI](https://arxiv.org/list/cs.AI/recent) still have 50-100 papers a day. Determining if these papers are relevant and important to you means reading through the title and abstract, which is time-consuming. +- **Multi-Model Integration**: Support for OpenAI, Gemini, and Claude models for paper analysis +- **Latest Models**: Support for GPT-4o, GPT-4o mini, Claude 3.5, and other current models +- **Two-Stage Processing**: Efficient paper analysis with quick filtering followed by detailed analysis +- **Enhanced Analysis**: Detailed paper breakdowns including key innovations, critical analysis, and practical applications +- **HTML Report Generation**: Clean, organized reports saved with date-based filenames +- **Adjustable Relevancy Threshold**: Interactive slider for filtering papers by relevance score +- **Design Automation Backend**: Specialized tools for analyzing design-related papers +- **Topic Clustering**: Group similar papers using AI-powered clustering (Gemini) +- **Robust JSON Parsing**: Reliable extraction of analysis results from LLM responses +- **Standardized Directory Structure**: Organized codebase with `/src`, `/data`, and `/digest` directories +- **Improved Web UI**: Clean Gradio interface with dynamic topic selection and error handling -This repository offers a method to curate a daily digest, sorted by relevance, using large language models. These models are conditioned based on your personal research interests, which are described in natural language. +![](./readme_images/UIarxiv.png) -* You modify the configuration file `config.yaml` with an arXiv Subject, some set of Categories, and a natural language statement about the type of papers you are interested in. -* The code pulls all the abstracts for papers in those categories and ranks how relevant they are to your interest on a scale of 1-10 using `gpt-3.5-turbo-16k`. -* The code then emits an HTML digest listing all the relevant papers, and optionally emails it to you using [SendGrid](https://sendgrid.com). You will need to have a SendGrid account with an API key for this functionality to work. +## πŸš€ Quick Start -### Testing it out with Hugging Face: +Try it out on [Hugging Face](https://huggingface.co/spaces/linhkid91/ArxivDigest-extra) using your own API keys. -We provide a demo at [https://huggingface.co/spaces/AutoLLM/ArxivDigest](https://huggingface.co/spaces/AutoLLM/ArxivDigest). Simply enter your [OpenAI API key](https://platform.openai.com/account/api-keys) and then fill in the configuration on the right. Note that we do not store your key. +## πŸ” What This Repo Does -![hfexample](./readme_images/hf_example.png) +Staying up to date on [arXiv](https://arxiv.org) papers is time-consuming, with hundreds of new papers published daily. Even with the [official daily digest service](https://info.arxiv.org/help/subscribe.html), categories like [cs.AI](https://arxiv.org/list/cs.AI/recent) still contain 50-100 papers per day. -You can also send yourself an email of the digest by creating a SendGrid account and [API key](https://app.SendGrid.com/settings/api_keys). +This repository creates a personalized daily digest by: -### Some examples of results: +1. **Crawling arXiv** for recent papers in your areas of interest +2. **Analyzing papers** in-depth using AI models (OpenAI, Gemini, or Claude) +3. **Two-stage processing** for efficiency: + - Stage 1: Quick relevancy filtering using only title and abstract + - Stage 2: Detailed analysis of papers that meet the relevancy threshold +4. **Scoring relevance** on a scale of 1-10 based on your research interests +5. **Providing detailed analysis** of each paper, including: + - Key innovations + - Critical analysis + - Implementation details + - Practical applications + - Related work +6. **Generating reports** in HTML format with clean organization -#### Digest Configuration: -- Subject/Topic: Computer Science -- Categories: Artificial Intelligence, Computation and Language -- Interest: - - Large language model pretraining and finetunings - - Multimodal machine learning - - Do not care about specific application, for example, information extraction, summarization, etc. - - Not interested in paper focus on specific languages, e.g., Arabic, Chinese, etc. +## πŸ€– Model Integrations -#### Result: -

+The system supports three major AI providers: -#### Digest Configuration: -- Subject/Topic: Quantitative Finance -- Interest: "making lots of money" +- **OpenAI GPT** (gpt-3.5-turbo-16k, gpt-4, gpt-4-turbo, gpt-4o, gpt-4o-mini) +- **Google Gemini** (gemini-1.5-flash, gemini-1.5-pro, gemini-2.0-flash) +- **Anthropic Claude** (claude-3-haiku, claude-3-sonnet, claude-3-opus, claude-3.5-sonnet) -#### Result: -

+You can use any combination of these models, allowing you to compare results or choose based on your needs. -## πŸ’‘ Usage +## πŸ“Š Output Formats -### Running as a github action using SendGrid (Recommended). +Reports are generated in multiple formats: -The recommended way to get started using this repository is to: +- **HTML Reports**: Clean, organized reports saved to the `/digest` directory with date-based filenames +- **Console Output**: Summary information displayed in the terminal +- **JSON Data**: Raw paper data saved to the `/data` directory -1. Fork the repository -2. Modify `config.yaml` and merge the changes into your main branch. -3. Set the following secrets [(under settings, Secrets and variables, repository secrets)](https://docs.github.com/en/actions/security-guides/encrypted-secrets#creating-encrypted-secrets-for-a-repository). See [Advanced Usage](./advanced_usage.md#create-and-fetch-your-api-keys) for more details on how to create and get OpenAi and SendGrid API keys: - - `OPENAI_API_KEY` From [OpenAI](https://platform.openai.com/account/api-keys) - - `SENDGRID_API_KEY` From [SendGrid](https://app.SendGrid.com/settings/api_keys) - - `FROM_EMAIL` This value must match the email you used to create the SendGrid API Key. - - `TO_EMAIL` -4. Manually trigger the action or wait until the scheduled action takes place. +Every HTML report includes: +- Paper title, authors, and link to arXiv +- Relevancy score with explanation +- Abstract and key innovations +- Critical analysis and implementation details +- Experiments, results, and discussion points +- Related work and practical applications -See [Advanced Usage](./advanced_usage.md) for more details, including step-by-step images, further customization, and alternate usage. +Example HTML report: -### Running with a user interface +![](/readme_images/example_report.png) +## πŸ’‘ Setting Up and Usage -To locally run the same UI as the Huggign Face space: - -1. Install the requirements in `src/requirements.txt` as well as `gradio`. -2. Run `python src/app.py` and go to the local URL. From there you will be able to preview the papers from today, as well as the generated digests. -3. If you want to use a `.env` file for your secrets, you can copy `.env.template` to `.env` and then set the environment variables in `.env`. -- Note: These file may be hidden by default in some operating systems due to the dot prefix. -- The .env file is one of the files in .gitignore, so git does not track it and it will not be uploaded to the repository. -- Do not edit the original `.env.template` with your keys or your email address, since `.template.env` is tracked by git and editing it might cause you to commit your secrets. +### Configuration -> **WARNING:** Do not edit and commit your `.env.template` with your personal keys or email address! Doing so may expose these to the world! +Modify `config.yaml` with your preferences: -## βœ… Roadmap +```yaml +# Main research area +topic: "Computer Science" + +# Specific categories to monitor +categories: ["Artificial Intelligence", "Computation and Language", "Machine Learning", "Information Retrieval"] + +# Minimum relevance score (1-10) +threshold: 2 + +# Your research interests in natural language +interest: | + 1. AI alignment and AI safety + 2. Mechanistic interpretability and explainable AI + 3. Large language model optimization + 4. RAGs, Information retrieval + 5. AI Red teaming, deception and misalignment +``` + +### Running the Web Interface + +To run locally with the simplified UI: + +1. Install requirements: `pip install -r requirements.txt` +2. Run the app: `python src/app_new.py` +3. Open the URL displayed in your terminal +4. Enter your API key(s) and configure your preferences +5. Use the relevancy threshold slider to adjust paper filtering (default is 2) + +### Running via GitHub Action + +To set up automated daily digests: -- [x] Support personalized paper recommendation using LLM. -- [x] Send emails for daily digest. -- [ ] Implement a ranking factor to prioritize content from specific authors. -- [ ] Support open-source models, e.g., LLaMA, Vicuna, MPT etc. -- [ ] Fine-tune an open-source model to better support paper ranking and stay updated with the latest research concepts.. +1. Fork this repository +2. Update `config.yaml` with your preferences +3. Set the following secrets in your repository settings: + - `OPENAI_API_KEY` (and/or `GEMINI_API_KEY` or `ANTHROPIC_API_KEY`) +4. The GitHub Action will run on schedule or can be triggered manually +### Running from Command Line -## πŸ’ Extending and Contributing +For advanced users: -You may (and are encourage to) modify the code in this repository to suit your personal needs. If you think your modifications would be in any way useful to others, please submit a pull request. +```bash +# Regular paper digests with simplified UI +python src/app_new.py + +# Design paper finder +./src/design/find_design_papers.sh --days 7 --analyze +``` + +## ⚠️ API Usage Notes + +This tool respects arXiv's robots.txt and implements proper rate limiting. If you encounter 403 Forbidden errors: + +1. Wait a few hours before trying again +2. Consider reducing the number of categories you're fetching +3. Increase the delay between requests in the code + +## πŸ“ Directory Structure + +The repository is organized as follows: + +- `/src` - All Python source code + - `app_new.py` - Simplified interface with improved threshold handling and UI + - `download_new_papers.py` - arXiv crawler + - `relevancy.py` - Paper scoring and analysis with robust JSON parsing + - `model_manager.py` - Multi-model integration + - `gemini_utils.py` - Gemini API integration + - `anthropic_utils.py` - Claude API integration + - `design/` - Design automation tools + - `paths.py` - Standardized path handling +- `/data` - JSON data files (auto-created) +- `/digest` - HTML report files (auto-created) + +## βœ… Roadmap -These types of modifications include things like changes to the prompt, different language models, or additional ways for the digest is delivered to you. +- [x] Support multiple AI models (OpenAI, Gemini, Claude) +- [x] Generate comprehensive HTML reports with date-based filenames +- [x] Specialized analysis for design automation papers +- [x] Topic clustering via Gemini +- [x] Standardized directory structure +- [x] Enhanced HTML reports with detailed analysis sections +- [x] Pre-filtering of arXiv categories for efficiency +- [x] Adjustable relevancy threshold with UI slider +- [x] Robust JSON parsing for reliable LLM response handling +- [x] Simplified UI focused on core functionality +- [x] Dynamic topic selection UI with improved error handling +- [x] Support for newer models (GPT-4o, GPT-4o mini, Claude 3.5) +- [x] Two-stage paper processing for efficiency (quick filtering followed by detailed analysis) +- [x] Removed email functionality in favor of local HTML reports +- [ ] Full PDF content analysis +- [ ] Author-based ranking and filtering +- [ ] Fine-tuned open-source model support: Ollama, LocalAI... + +## πŸ’ Contributing + +You're encouraged to modify this code for your personal needs. If your modifications would be useful to others, please submit a pull request. + +Valuable contributions include: +- Additional AI model integrations +- New analysis capabilities +- UI improvements +- Prompt engineering enhancements diff --git a/config.yaml b/config.yaml index feabf85..ed63acf 100644 --- a/config.yaml +++ b/config.yaml @@ -3,13 +3,13 @@ topic: "Computer Science" # An empty list here will include all categories in a topic # Use the natural language names of the topics, found here: https://arxiv.org # Including more categories will result in more calls to the large language model -categories: ["Artificial Intelligence", "Computation and Language"] +categories: ["Artificial Intelligence", "Computation and Language", "Machine Learning", "Information Retrieval"] # Relevance score threshold. abstracts that receive a score less than this from the large language model # will have their papers filtered out. # # Must be within 1-10 -threshold: 7 +threshold: 2 # A natural language statement that the large language model will use to judge which papers are relevant # @@ -21,7 +21,10 @@ threshold: 7 # This can be empty, which just return a full list of papers with no judgement or filtering, # in whatever order arXiv responds with. interest: | - 1. Large language model pretraining and finetunings - 2. Multimodal machine learning - 3. Do not care about specific application, for example, information extraction, summarization, etc. - 4. Not interested in paper focus on specific languages, e.g., Arabic, Chinese, etc. + 1. AI alignment and AI safety + 2. Mechanistic interpretability and explainable AI + 3. Large language model under pressure + 4. AI Red teaming, deception and misalignment + 5. RAGs, Information retrieval + 6. Optimization of LLM and GenAI + 7. Do not care about specific application, for example, information extraction, summarization, etc. diff --git a/digest/design_papers.html b/digest/design_papers.html new file mode 100644 index 0000000..90f98f6 --- /dev/null +++ b/digest/design_papers.html @@ -0,0 +1,177 @@ + + + + + + Design Automation Papers + + + +

Design Automation Papers

+
+

Found 18 papers related to graphic design automation with AI/ML

+

Generated on 2025-04-06 13:57:23

+
+

Summary Statistics

Categories:

Techniques:

+
+
Concept Lancet: Image Editing with Compositional Representation Transplant
+
Jinqi Luo, Tianjiao Ding, Kwan Ho Ryan Chan, Hancheng Min, Chris Callison-Burch, RenΓ© Vidal
+
Category: Layout Generation, UI/UX Design, Image Manipulation | Subject: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
+
Techniques: Diffusion Models
+
Abstract: Diffusion models are widely used for image editing tasks. Existing editing methods often design a representation manipulation procedure by curating an edit direction in the text embedding or score space. However, such a procedure faces a key challenge: overestimating the edit strength harms visual consistency while underestimating it fails the editing task. Notably, each source image may require a different editing strength, and it is costly to search for an appropriate strength via trial-and-error. To address this challenge, we propose Concept Lancet (CoLan), a zero-shot plug-and-play framework for principled representation manipulation in diffusion-based image editing. At inference time, we decompose the source input in the latent (text embedding or diffusion score) space as a sparse linear combination of the representations of the collected visual concepts. This allows us to accurately estimate the presence of concepts in each image, which informs the edit. Based on the editing task (replace/add/remove), we perform a customized concept transplant process to impose the corresponding editing direction. To sufficiently model the concept space, we curate a conceptual representation dataset, CoLan-150K, which contains diverse descriptions and scenarios of visual terms and phrases for the latent dictionary. Experiments on multiple diffusion-based image editing baselines show that methods equipped with CoLan achieve state-of-the-art performance in editing effectiveness and consistency preservation.
+
+ +
+
GPTQv2: Efficient Finetuning-Free Quantization for Asymmetric Calibration
+
Yuhang Li, Ruokai Yin, Donghyun Lee, Shiting Xiao, Priyadarshini Panda
+
Category: Layout Generation | Subject: Machine Learning (cs.LG)
+
Techniques: Transformers, Reinforcement Learning, Computer Vision, Large Language Models
+
Abstract: We introduce GPTQv2, a novel finetuning-free quantization method for compressing large-scale transformer architectures. Unlike the previous GPTQ method, which independently calibrates each layer, we always match the quantized layer's output to the exact output in the full-precision model, resulting in a scheme that we call asymmetric calibration. Such a scheme can effectively reduce the quantization error accumulated in previous layers. We analyze this problem using optimal brain compression to derive a close-formed solution. The new solution explicitly minimizes the quantization error as well as the accumulated asymmetry error. Furthermore, we utilize various techniques to parallelize the solution calculation, including channel parallelization, neuron decomposition, and Cholesky reformulation for matrix fusion. As a result, GPTQv2 is easy to implement, simply using 20 more lines of code than GPTQ but improving its performance under low-bit quantization. Remarkably, on a single GPU, we quantize a 405B language transformer as well as EVA-02 the rank first vision transformer that achieves 90% pretraining Imagenet accuracy. Code is available at this http URL.
+
+ +
+
Compositionality Unlocks Deep Interpretable Models
+
Thomas Dooms, Ward Gauderis, Geraint A. Wiggins, Jose Oramas
+
Category: Layout Generation | Subject: Machine Learning (cs.LG)
+
Techniques:
+
Abstract: We propose $\chi$-net, an intrinsically interpretable architecture combining the compositional multilinear structure of tensor networks with the expressivity and efficiency of deep neural networks. $\chi$-nets retain equal accuracy compared to their baseline counterparts. Our novel, efficient diagonalisation algorithm, ODT, reveals linear low-rank structure in a multilayer SVHN model. We leverage this toward formal weight-based interpretability and model compression.
+
+ +
+
Efficient Model Editing with Task-Localized Sparse Fine-tuning
+
Leonardo Iurada, Marco Ciccone, Tatiana Tommasi
+
Category: Layout Generation, UI/UX Design | Subject: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
+
Techniques: Reinforcement Learning
+
Abstract: Task arithmetic has emerged as a promising approach for editing models by representing task-specific knowledge as composable task vectors. However, existing methods rely on network linearization to derive task vectors, leading to computational bottlenecks during training and inference. Moreover, linearization alone does not ensure weight disentanglement, the key property that enables conflict-free composition of task vectors. To address this, we propose TaLoS which allows to build sparse task vectors with minimal interference without requiring explicit linearization and sharing information across tasks. We find that pre-trained models contain a subset of parameters with consistently low gradient sensitivity across tasks, and that sparsely updating only these parameters allows for promoting weight disentanglement during fine-tuning. Our experiments prove that TaLoS improves training and inference efficiency while outperforming current methods in task addition and negation. By enabling modular parameter editing, our approach fosters practical deployment of adaptable foundation models in real-world applications.
+
+ +
+
Knowledge Graph Completion with Mixed Geometry Tensor Factorization
+
Viacheslav Yusupov, Maxim Rakhuba, Evgeny Frolov
+
Category: Layout Generation | Subject: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (stat.ML)
+
Techniques: Reinforcement Learning
+
Abstract: In this paper, we propose a new geometric approach for knowledge graph completion via low rank tensor approximation. We augment a pretrained and well-established Euclidean model based on a Tucker tensor decomposition with a novel hyperbolic interaction term. This correction enables more nuanced capturing of distributional properties in data better aligned with real-world knowledge graphs. By combining two geometries together, our approach improves expressivity of the resulting model achieving new state-of-the-art link prediction accuracy with a significantly lower number of parameters compared to the previous Euclidean and hyperbolic models.
+
+ +
+
Leveraging LLM For Synchronizing Information Across Multilingual Tables
+
Siddharth Khincha, Tushar Kataria, Ankita Anand, Dan Roth, Vivek Gupta
+
Category: Layout Generation | Subject: Computation and Language (cs.CL)
+
Techniques: Reinforcement Learning, Large Language Models
+
Abstract: The vast amount of online information today poses challenges for non-English speakers, as much of it is concentrated in high-resource languages such as English and French. Wikipedia reflects this imbalance, with content in low-resource languages frequently outdated or incomplete. Recent research has sought to improve cross-language synchronization of Wikipedia tables using rule-based methods. These approaches can be effective, but they struggle with complexity and generalization. This paper explores large language models (LLMs) for multilingual information synchronization, using zero-shot prompting as a scalable solution. We introduce the Information Updation dataset, simulating the real-world process of updating outdated Wikipedia tables, and evaluate LLM performance. Our findings reveal that single-prompt approaches often produce suboptimal results, prompting us to introduce a task decomposition strategy that enhances coherence and accuracy. Our proposed method outperforms existing baselines, particularly in Information Updation (1.79%) and Information Addition (20.58%), highlighting the model strength in dynamically updating and enriching data across architectures
+
+ +
+
Improving User Experience with FAICO: Towards a Framework for AI Communication in Human-AI Co-Creativity
+
Jeba Rezwana, Corey Ford
+
Category: UI/UX Design, Design Tools | Subject: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
+
Techniques:
+
Abstract: How AI communicates with humans is crucial for effective human-AI co-creation. However, many existing co-creative AI tools cannot communicate effectively, limiting their potential as collaborators. This paper introduces our initial design of a Framework for designing AI Communication (FAICO) for co-creative AI based on a systematic review of 107 full-length papers. FAICO presents key aspects of AI communication and their impacts on user experience to guide the design of effective AI communication. We then show actionable ways to translate our framework into two practical tools: design cards for designers and a configuration tool for users. The design cards enable designers to consider AI communication strategies that cater to a diverse range of users in co-creative contexts, while the configuration tool empowers users to customize AI communication based on their needs and creative workflows. This paper contributes new insights within the literature on human-AI co-creativity and Human-Computer Interaction, focusing on designing AI communication to enhance user experience.
+
+ +
+
Charm: The Missing Piece in ViT fine-tuning for Image Aesthetic Assessment
+
Fatemeh Behrad, Tinne Tuytelaars, Johan Wagemans
+
Category: Layout Generation | Subject: Computer Vision and Pattern Recognition (cs.CV)
+
Techniques: Transformers, Reinforcement Learning, Computer Vision
+
Abstract: The capacity of Vision transformers (ViTs) to handle variable-sized inputs is often constrained by computational complexity and batch processing limitations. Consequently, ViTs are typically trained on small, fixed-size images obtained through downscaling or cropping. While reducing computational burden, these methods result in significant information loss, negatively affecting tasks like image aesthetic assessment. We introduce Charm, a novel tokenization approach that preserves Composition, High-resolution, Aspect Ratio, and Multi-scale information simultaneously. Charm prioritizes high-resolution details in specific regions while downscaling others, enabling shorter fixed-size input sequences for ViTs while incorporating essential information. Charm is designed to be compatible with pre-trained ViTs and their learned positional embeddings. By providing multiscale input and introducing variety to input tokens, Charm improves ViT performance and generalizability for image aesthetic assessment. We avoid cropping or changing the aspect ratio to further preserve information. Extensive experiments demonstrate significant performance improvements on various image aesthetic and quality assessment datasets (up to 8.1 %) using a lightweight ViT backbone. Code and pre-trained models are available at this https URL.
+
+ +
+
VISTA: Unsupervised 2D Temporal Dependency Representations for Time Series Anomaly Detection
+
Sinchee Chin, Fan Zhang, Xiaochen Yang, Jing-Hao Xue, Wenming Yang, Peng Jia, Guijin Wang, Luo Yingqun
+
Category: Layout Generation, UI/UX Design, 3D Design | Subject: Machine Learning (cs.LG); Information Theory (cs.IT)
+
Techniques: Reinforcement Learning
+
Abstract: Time Series Anomaly Detection (TSAD) is essential for uncovering rare and potentially harmful events in unlabeled time series data. Existing methods are highly dependent on clean, high-quality inputs, making them susceptible to noise and real-world imperfections. Additionally, intricate temporal relationships in time series data are often inadequately captured in traditional 1D representations, leading to suboptimal modeling of dependencies. We introduce VISTA, a training-free, unsupervised TSAD algorithm designed to overcome these challenges. VISTA features three core modules: 1) Time Series Decomposition using Seasonal and Trend Decomposition via Loess (STL) to decompose noisy time series into trend, seasonal, and residual components; 2) Temporal Self-Attention, which transforms 1D time series into 2D temporal correlation matrices for richer dependency modeling and anomaly detection; and 3) Multivariate Temporal Aggregation, which uses a pretrained feature extractor to integrate cross-variable information into a unified, memory-efficient representation. VISTA's training-free approach enables rapid deployment and easy hyperparameter tuning, making it suitable for industrial applications. It achieves state-of-the-art performance on five multivariate TSAD benchmarks.
+
+ +
+
BOOST: Bootstrapping Strategy-Driven Reasoning Programs for Program-Guided Fact-Checking
+
Qisheng Hu, Quanyu Long, Wenya Wang
+
Category: Layout Generation, UI/UX Design | Subject: Artificial Intelligence (cs.AI)
+
Techniques: Reinforcement Learning
+
Abstract: Program-guided reasoning has shown promise in complex claim fact-checking by decomposing claims into function calls and executing reasoning programs. However, prior work primarily relies on few-shot in-context learning (ICL) with ad-hoc demonstrations, which limit program diversity and require manual design with substantial domain knowledge. Fundamentally, the underlying principles of effective reasoning program generation still remain underexplored, making it challenging to construct effective demonstrations. To address this, we propose BOOST, a bootstrapping-based framework for few-shot reasoning program generation. BOOST explicitly integrates claim decomposition and information-gathering strategies as structural guidance for program generation, iteratively refining bootstrapped demonstrations in a strategy-driven and data-centric manner without human intervention. This enables a seamless transition from zero-shot to few-shot strategic program-guided learning, enhancing interpretability and effectiveness. Experimental results show that BOOST outperforms prior few-shot baselines in both zero-shot and few-shot settings for complex claim verification.
+
+ +
+
ConMo: Controllable Motion Disentanglement and Recomposition for Zero-Shot Motion Transfer
+
Jiayi Gao, Zijin Yin, Changcheng Hua, Yuxin Peng, Kongming Liang, Zhanyu Ma, Jun Guo, Yang Liu
+
Category: Layout Generation, UI/UX Design | Subject: Computer Vision and Pattern Recognition (cs.CV)
+
Techniques: Reinforcement Learning
+
Abstract: The development of Text-to-Video (T2V) generation has made motion transfer possible, enabling the control of video motion based on existing footage. However, current methods have two limitations: 1) struggle to handle multi-subjects videos, failing to transfer specific subject motion; 2) struggle to preserve the diversity and accuracy of motion as transferring to subjects with varying shapes. To overcome these, we introduce \textbf{ConMo}, a zero-shot framework that disentangle and recompose the motions of subjects and camera movements. ConMo isolates individual subject and background motion cues from complex trajectories in source videos using only subject masks, and reassembles them for target video generation. This approach enables more accurate motion control across diverse subjects and improves performance in multi-subject scenarios. Additionally, we propose soft guidance in the recomposition stage which controls the retention of original motion to adjust shape constraints, aiding subject shape adaptation and semantic transformation. Unlike previous methods, ConMo unlocks a wide range of applications, including subject size and position editing, subject removal, semantic modifications, and camera motion simulation. Extensive experiments demonstrate that ConMo significantly outperforms state-of-the-art methods in motion fidelity and semantic consistency. The code is available at this https URL.
+
+ +
+
SkyReels-A2: Compose Anything in Video Diffusion Transformers
+
Zhengcong Fei, Debang Li, Di Qiu, Jiahua Wang, Yikun Dou, Rui Wang, Jingtao Xu, Mingyuan Fan, Guibin Chen, Yang Li, Yahui Zhou
+
Category: Layout Generation | Subject: Computer Vision and Pattern Recognition (cs.CV)
+
Techniques: Diffusion Models, Transformers
+
Abstract: This paper presents SkyReels-A2, a controllable video generation framework capable of assembling arbitrary visual elements (e.g., characters, objects, backgrounds) into synthesized videos based on textual prompts while maintaining strict consistency with reference images for each element. We term this task elements-to-video (E2V), whose primary challenges lie in preserving the fidelity of each reference element, ensuring coherent composition of the scene, and achieving natural outputs. To address these, we first design a comprehensive data pipeline to construct prompt-reference-video triplets for model training. Next, we propose a novel image-text joint embedding model to inject multi-element representations into the generative process, balancing element-specific consistency with global coherence and text alignment. We also optimize the inference pipeline for both speed and output stability. Moreover, we introduce a carefully curated benchmark for systematic evaluation, i.e, A2 Bench. Experiments demonstrate that our framework can generate diverse, high-quality videos with precise element control. SkyReels-A2 is the first open-source commercial grade model for the generation of E2V, performing favorably against advanced closed-source commercial models. We anticipate SkyReels-A2 will advance creative applications such as drama and virtual e-commerce, pushing the boundaries of controllable video generation.
+
+ +
+
Robust Randomized Low-Rank Approximation with Row-Wise Outlier Detection
+
Aidan Tiruvan
+
Category: Layout Generation | Subject: Machine Learning (cs.LG); Numerical Analysis (math.NA)
+
Techniques: Reinforcement Learning
+
Abstract: Robust low-rank approximation under row-wise adversarial corruption can be achieved with a single pass, randomized procedure that detects and removes outlier rows by thresholding their projected norms. We propose a scalable, non-iterative algorithm that efficiently recovers the underlying low-rank structure in the presence of row-wise adversarial corruption. By first compressing the data with a Johnson Lindenstrauss projection, our approach preserves the geometry of clean rows while dramatically reducing dimensionality. Robust statistical techniques based on the median and median absolute deviation then enable precise identification and removal of outlier rows with abnormally high norms. The subsequent rank-k approximation achieves near-optimal error bounds with a one pass procedure that scales linearly with the number of observations. Empirical results confirm that combining random sketches with robust statistics yields efficient, accurate decompositions even in the presence of large fractions of corrupted rows.
+
+ +
+
MG-Gen: Single Image to Motion Graphics Generation with Layer Decomposition
+
Takahiro Shirakawa, Tomoyuki Suzuki, Daichi Haraguchi
+
Category: Layout Generation, UI/UX Design | Subject: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
+
Techniques:
+
Abstract: General image-to-video generation methods often produce suboptimal animations that do not meet the requirements of animated graphics, as they lack active text motion and exhibit object distortion. Also, code-based animation generation methods typically require layer-structured vector data which are often not readily available for motion graphic generation. To address these challenges, we propose a novel framework named MG-Gen that reconstructs data in vector format from a single raster image to extend the capabilities of code-based methods to enable motion graphics generation from a raster image in the framework of general image-to-video generation. MG-Gen first decomposes the input image into layer-wise elements, reconstructs them as HTML format data and then generates executable JavaScript code for the reconstructed HTML data. We experimentally confirm that \ours{} generates motion graphics while preserving text readability and input consistency. These successful results indicate that combining layer decomposition and animation code generation is an effective strategy for motion graphics generation.
+
+ +
+
LearNAT: Learning NL2SQL with AST-guided Task Decomposition for Large Language Models
+
Weibin Liao, Xin Gao, Tianyu Jia, Rihong Qiu, Yifan Zhu, Yang Lin, Xu Chu, Junfeng Zhao, Yasha Wang
+
Category: Layout Generation, UI/UX Design | Subject: Computation and Language (cs.CL)
+
Techniques: Reinforcement Learning, Large Language Models
+
Abstract: Natural Language to SQL (NL2SQL) has emerged as a critical task for enabling seamless interaction with databases. Recent advancements in Large Language Models (LLMs) have demonstrated remarkable performance in this domain. However, existing NL2SQL methods predominantly rely on closed-source LLMs leveraging prompt engineering, while open-source models typically require fine-tuning to acquire domain-specific knowledge. Despite these efforts, open-source LLMs struggle with complex NL2SQL tasks due to the indirect expression of user query objectives and the semantic gap between user queries and database schemas. Inspired by the application of reinforcement learning in mathematical problem-solving to encourage step-by-step reasoning in LLMs, we propose LearNAT (Learning NL2SQL with AST-guided Task Decomposition), a novel framework that improves the performance of open-source LLMs on complex NL2SQL tasks through task decomposition and reinforcement learning. LearNAT introduces three key components: (1) a Decomposition Synthesis Procedure that leverages Abstract Syntax Trees (ASTs) to guide efficient search and pruning strategies for task decomposition, (2) Margin-aware Reinforcement Learning, which employs fine-grained step-level optimization via DPO with AST margins, and (3) Adaptive Demonstration Reasoning, a mechanism for dynamically selecting relevant examples to enhance decomposition capabilities. Extensive experiments on two benchmark datasets, Spider and BIRD, demonstrate that LearNAT enables a 7B-parameter open-source LLM to achieve performance comparable to GPT-4, while offering improved efficiency and accessibility.
+
+ +
+
AC-LoRA: Auto Component LoRA for Personalized Artistic Style Image Generation
+
Zhipu Cui, Andong Tian, Zhi Ying, Jialiang Lu
+
Category: Layout Generation, Multimodal Design | Subject: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
+
Techniques:
+
Abstract: Personalized image generation allows users to preserve styles or subjects of a provided small set of images for further image generation. With the advancement in large text-to-image models, many techniques have been developed to efficiently fine-tune those models for personalization, such as Low Rank Adaptation (LoRA). However, LoRA-based methods often face the challenge of adjusting the rank parameter to achieve satisfactory results. To address this challenge, AutoComponent-LoRA (AC-LoRA) is proposed, which is able to automatically separate the signal component and noise component of the LoRA matrices for fast and efficient personalized artistic style image generation. This method is based on Singular Value Decomposition (SVD) and dynamic heuristics to update the hyperparameters during training. Superior performance over existing methods in overcoming model underfitting or overfitting problems is demonstrated. The results were validated using FID, CLIP, DINO, and ImageReward, achieving an average of 9% improvement.
+
+ +
+
On the Geometry of Receiver Operating Characteristic and Precision-Recall Curves
+
Reza Sameni
+
Category: Layout Generation, UI/UX Design, Design Tools | Subject: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Statistics Theory (math.ST); Machine Learning (stat.ML)
+
Techniques: Reinforcement Learning
+
Abstract: We study the geometry of Receiver Operating Characteristic (ROC) and Precision-Recall (PR) curves in binary classification problems. The key finding is that many of the most commonly used binary classification metrics are merely functions of the composition function $G := F_p \circ F_n^{-1}$, where $F_p(\cdot)$ and $F_n(\cdot)$ are the class-conditional cumulative distribution functions of the classifier scores in the positive and negative classes, respectively. This geometric perspective facilitates the selection of operating points, understanding the effect of decision thresholds, and comparison between classifiers. It also helps explain how the shapes and geometry of ROC/PR curves reflect classifier behavior, providing objective tools for building classifiers optimized for specific applications with context-specific constraints. We further explore the conditions for classifier dominance, present analytical and numerical examples demonstrating the effects of class separability and variance on ROC and PR geometries, and derive a link between the positive-to-negative class leakage function $G(\cdot)$ and the Kullback--Leibler divergence. The framework highlights practical considerations, such as model calibration, cost-sensitive optimization, and operating point selection under real-world capacity constraints, enabling more informed approaches to classifier deployment and decision-making.
+
+ +
+
Fourier Feature Attribution: A New Efficiency Attribution Method
+
Zechen Liu, Feiyang Zhang, Wei Song, Xiang Li, Wei Wei
+
Category: Layout Generation, UI/UX Design | Subject: Machine Learning (cs.LG)
+
Techniques: Transformers, Computer Vision
+
Abstract: The study of neural networks from the perspective of Fourier features has garnered significant attention. While existing analytical research suggests that neural networks tend to learn low-frequency features, a clear attribution method for identifying the specific learned Fourier features has remained elusive. To bridge this gap, we propose a novel Fourier feature attribution method grounded in signal decomposition theory. Additionally, we analyze the differences between game-theoretic attribution metrics for Fourier and spatial domain features, demonstrating that game-theoretic evaluation metrics are better suited for Fourier-based feature attribution. Our experiments show that Fourier feature attribution exhibits superior feature selection capabilities compared to spatial domain attribution methods. For instance, in the case of Vision Transformers (ViTs) on the ImageNet dataset, only $8\%$ of the Fourier features are required to maintain the original predictions for $80\%$ of the samples. Furthermore, we compare the specificity of features identified by our method against traditional spatial domain attribution methods. Results reveal that Fourier features exhibit greater intra-class concentration and inter-class distinctiveness, indicating their potential for more efficient classification and explainable AI algorithms.
+
+ + + + + \ No newline at end of file diff --git a/find_design_papers.sh b/find_design_papers.sh new file mode 100755 index 0000000..6faffb2 --- /dev/null +++ b/find_design_papers.sh @@ -0,0 +1,13 @@ +#!/bin/bash +# Root-level wrapper script for the design papers finder + +# Show deprecation warning +echo "ℹ️ Note: This script is a wrapper for ./src/design/find_design_papers.sh" +echo "ℹ️ Consider using ./src/design/find_design_papers.sh directly for best results" +echo "" + +# Simply forward all arguments to the actual script +./src/design/find_design_papers.sh "$@" + +# The exit code will propagate from the called script +exit $? \ No newline at end of file diff --git a/readme_images/UIarxiv.png b/readme_images/UIarxiv.png new file mode 100644 index 0000000..9a6b793 Binary files /dev/null and b/readme_images/UIarxiv.png differ diff --git a/readme_images/example_custom_1.png b/readme_images/example_custom_1.png new file mode 100644 index 0000000..0265f76 Binary files /dev/null and b/readme_images/example_custom_1.png differ diff --git a/readme_images/example_report.png b/readme_images/example_report.png new file mode 100644 index 0000000..ea2163e Binary files /dev/null and b/readme_images/example_report.png differ diff --git a/readme_images/main_banner.png b/readme_images/main_banner.png new file mode 100644 index 0000000..56fe575 Binary files /dev/null and b/readme_images/main_banner.png differ diff --git a/requirements.txt b/requirements.txt index c7c0b55..524a7e3 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,8 +1,11 @@ PyYAML==6.0 beautifulsoup4==4.12.2 numpy==1.24.2 -openai==0.27.8 +openai>=1.3.0 python-dotenv==1.0.0 pytz==2023.3 sendgrid==6.10.0 tqdm==4.65.0 +google-generativeai>=0.3.0 +anthropic>=0.8.0 +gradio>=3.50.0 \ No newline at end of file diff --git a/run.sh b/run.sh new file mode 100755 index 0000000..68f03fa --- /dev/null +++ b/run.sh @@ -0,0 +1,5 @@ +#!/bin/bash +# Run the ArxivDigest-extra app using the latest version +echo "Starting ArxivDigest-extra..." +cd "$(dirname "$0")" +python src/app_new.py \ No newline at end of file diff --git a/src/action.py b/src/action.py index 80f736c..4667981 100644 --- a/src/action.py +++ b/src/action.py @@ -1,8 +1,6 @@ from sendgrid import SendGridAPIClient from sendgrid.helpers.mail import Mail, Email, To, Content -from datetime import date - import argparse import yaml import os @@ -10,7 +8,11 @@ import openai from relevancy import generate_relevance_score, process_subject_fields from download_new_papers import get_papers +from datetime import date + +import ssl +ssl._create_default_https_context = ssl._create_stdlib_context # Hackathon quality code. Don't judge too harshly. # Feel free to submit pull requests to improve the code. @@ -222,6 +224,7 @@ def generate_body(topic, categories, interest, threshold): + f_papers = [] if topic == "Physics": raise RuntimeError("You must choose a physics subtopic.") elif topic in physics_topics: @@ -235,11 +238,13 @@ def generate_body(topic, categories, interest, threshold): if category not in category_map[topic]: raise RuntimeError(f"{category} is not a category of {topic}") papers = get_papers(abbr) + papers = [ t for t in papers if bool(set(process_subject_fields(t["subjects"])) & set(categories)) ] + else: papers = get_papers(abbr) if interest: @@ -247,11 +252,16 @@ def generate_body(topic, categories, interest, threshold): papers, query={"interest": interest}, threshold_score=threshold, - num_paper_in_prompt=16, + num_paper_in_prompt=2, ) + body = "

".join( [ - f'Title: {paper["title"]}
Authors: {paper["authors"]}
Score: {paper["Relevancy score"]}
Reason: {paper["Reasons for match"]}' + f'Subject: {paper["subjects"]}
Title: {paper["title"]}
Authors: {paper["authors"]}
' + f'Score: {paper["Relevancy score"]}
Reason: {paper["Reasons for match"]}
' + f'Goal: {paper["Goal"]}
Data: {paper["Data"]}
Methodology: {paper["Methodology"]}
' + f'Experiments & Results: {paper["Experiments & Results"]}
Git: {paper["Git"]}
' + f'Discussion & Next steps: {paper["Discussion & Next steps"]}' for paper in relevancy ] ) @@ -269,6 +279,10 @@ def generate_body(topic, categories, interest, threshold): ) return body +def get_date(): + today = date.today() + formatted_date = today.strftime("%d%m%Y") + return formatted_date if __name__ == "__main__": # Load the .env file. @@ -292,7 +306,8 @@ def generate_body(topic, categories, interest, threshold): threshold = config["threshold"] interest = config["interest"] body = generate_body(topic, categories, interest, threshold) - with open("digest.html", "w") as f: + today_date = get_date() + with open(f"digest_{today_date}.html", "w") as f: f.write(body) if os.environ.get("SENDGRID_API_KEY", None): sg = SendGridAPIClient(api_key=os.environ.get("SENDGRID_API_KEY")) diff --git a/src/anthropic_utils.py b/src/anthropic_utils.py new file mode 100644 index 0000000..e6ef0e7 --- /dev/null +++ b/src/anthropic_utils.py @@ -0,0 +1,322 @@ +""" +Anthropic/Claude API integration for ArxivDigest. +This module provides functions to work with Anthropic's Claude API for paper analysis. +""" +import os +import json +import logging +import time +from typing import List, Dict, Any, Optional + +try: + import anthropic + from anthropic.types import MessageParam + ANTHROPIC_AVAILABLE = True +except ImportError: + ANTHROPIC_AVAILABLE = False + +# Configure logging +logging.basicConfig(level=logging.INFO) +logger = logging.getLogger(__name__) + +class ClaudeConfig: + """Configuration for Claude API calls.""" + def __init__( + self, + temperature: float = 0.5, + max_tokens: int = 4000, + top_p: float = 0.95, + top_k: int = 40 + ): + self.temperature = temperature + self.max_tokens = max_tokens + self.top_p = top_p + self.top_k = top_k + +def setup_anthropic_api(api_key: str) -> bool: + """ + Setup the Anthropic API with the provided API key. + + Args: + api_key: Anthropic API key + + Returns: + bool: True if setup was successful, False otherwise + """ + if not ANTHROPIC_AVAILABLE: + logger.error("Anthropic package not installed. Run 'pip install anthropic'") + return False + + if not api_key: + logger.error("No Anthropic API key provided") + return False + + try: + # Initialize client to test connection + client = anthropic.Anthropic(api_key=api_key) + # Test API connection by listing models + models = client.models.list() + available_models = [model.id for model in models.data] + logger.info(f"Successfully connected to Anthropic API. Available models: {available_models}") + return True + except Exception as e: + logger.error(f"Failed to setup Anthropic API: {e}") + return False + +def get_claude_client(api_key: str) -> Optional[anthropic.Anthropic]: + """ + Get an Anthropic client with the given API key. + + Args: + api_key: Anthropic API key + + Returns: + Anthropic client or None if not available + """ + if not ANTHROPIC_AVAILABLE: + return None + + try: + client = anthropic.Anthropic(api_key=api_key) + return client + except Exception as e: + logger.error(f"Failed to get Anthropic client: {e}") + return None + +def analyze_papers_with_claude( + papers: List[Dict[str, Any]], + query: Dict[str, str], + config: Optional[ClaudeConfig] = None, + model_name: str = "claude-3.5-sonnet-20240620", + api_key: str = None +) -> List[Dict[str, Any]]: + """ + Analyze papers using Claude. + + Args: + papers: List of paper dictionaries + query: Dictionary with 'interest' key describing research interests + config: ClaudeConfig object + model_name: Name of the Claude model to use + api_key: Anthropic API key (optional if already configured elsewhere) + + Returns: + List of papers with added analysis + """ + if not ANTHROPIC_AVAILABLE: + logger.error("Anthropic package not installed. Cannot analyze papers.") + return papers + + if not config: + config = ClaudeConfig() + + # Get client + if api_key: + client = get_claude_client(api_key) + else: + # Try to get from environment + api_key = os.environ.get("ANTHROPIC_API_KEY", "") + if not api_key: + logger.error("No Anthropic API key provided") + return papers + client = get_claude_client(api_key) + + if not client: + return papers + + analyzed_papers = [] + + for paper in papers: + try: + # Prepare system prompt + system_prompt = f""" + You are a research assistant analyzing academic papers in AI and ML. + You provide comprehensive, accurate and unbiased analysis based on the user's research interests. + Your responses should be well-structured and factual, focusing on the paper's strengths, weaknesses, and relevance. + """ + + # Prepare user prompt + user_prompt = f""" + Analyze this paper and provide insights based on the following research interests: + + Research interests: {query['interest']} + + Paper details: + Title: {paper['title']} + Authors: {paper['authors']} + Abstract: {paper['abstract']} + Content: {paper['content'][:5000] if 'content' in paper else 'Not available'} + + Please provide your response as a single JSON object with the following structure: + {{ + "Relevancy score": 1-10 (higher = more relevant), + "Reasons for match": "Detailed explanation of why this paper matches the interests", + "Key innovations": "List the main contributions of the paper", + "Critical analysis": "Evaluate strengths and weaknesses", + "Goal": "What problem does the paper address?", + "Data": "Description of datasets used", + "Methodology": "Technical approach and methods", + "Implementation details": "Model architecture, hyperparameters, etc.", + "Experiments & Results": "Key findings and comparisons", + "Discussion & Next steps": "Limitations and future work", + "Related work": "Connection to similar research", + "Practical applications": "Real-world uses of this research", + "Key takeaways": ["Point 1", "Point 2", "Point 3"] + }} + + Format your response as a valid JSON object and nothing else. + """ + + # Just log that we're sending a prompt to Claude + print(f"Sending prompt to Claude for paper: {paper['title'][:50]}...") + + # Create message + messages: List[MessageParam] = [ + { + "role": "user", + "content": user_prompt + } + ] + + # Call the API + response = client.messages.create( + model=model_name, + max_tokens=config.max_tokens, + temperature=config.temperature, + system=system_prompt, + messages=messages + ) + + # Extract and parse the response + response_text = response.content[0].text if response.content else "" + + # Try to extract JSON + try: + start_idx = response_text.find('{') + end_idx = response_text.rfind('}') + 1 + if start_idx >= 0 and end_idx > start_idx: + json_str = response_text[start_idx:end_idx] + claude_analysis = json.loads(json_str) + + # Add Claude analysis to paper + paper['claude_analysis'] = claude_analysis + + # Directly copy fields to paper + for key, value in claude_analysis.items(): + paper[key] = value + else: + logger.warning(f"Could not extract JSON from Claude response for paper {paper['title']}") + paper['claude_analysis'] = {"error": "Failed to parse response"} + except json.JSONDecodeError: + logger.warning(f"Failed to parse Claude response as JSON for paper {paper['title']}") + paper['claude_analysis'] = {"error": "Failed to parse response"} + + analyzed_papers.append(paper) + + # Avoid rate limiting + time.sleep(1) + + except Exception as e: + logger.error(f"Claude API error: {e}") + paper['claude_analysis'] = {"error": f"Claude API error: {str(e)}"} + analyzed_papers.append(paper) + + return analyzed_papers + +def get_claude_interpretability_analysis(paper: Dict[str, Any], model_name: str = "claude-3.5-sonnet-20240620", api_key: str = None) -> Dict[str, Any]: + """ + Get specialized mechanistic interpretability analysis for a paper using Claude. + + Args: + paper: Paper dictionary + model_name: Claude model to use + api_key: Anthropic API key (optional if already configured elsewhere) + + Returns: + Dictionary with interpretability analysis + """ + if not ANTHROPIC_AVAILABLE: + return {"error": "Anthropic package not installed"} + + # Get client + if api_key: + client = get_claude_client(api_key) + else: + # Try to get from environment + api_key = os.environ.get("ANTHROPIC_API_KEY", "") + if not api_key: + return {"error": "No Anthropic API key provided"} + client = get_claude_client(api_key) + + if not client: + return {"error": "Failed to initialize Anthropic client"} + + try: + # Prepare system prompt + system_prompt = """ + You are a specialist in mechanistic interpretability and AI alignment. + Provide a thorough analysis of research papers with focus on interpretability methods, + circuit analysis, and how the work relates to understanding AI systems. + """ + + # Prepare the prompt + user_prompt = f""" + Analyze this paper from a mechanistic interpretability perspective: + + Title: {paper['title']} + Authors: {paper['authors']} + Abstract: {paper['abstract']} + Content: {paper['content'][:7000] if 'content' in paper else paper['abstract']} + + Please return your analysis as a JSON object with the following fields: + + {{ + "interpretability_score": 1-10 (how relevant is this to mechanistic interpretability), + "key_methods": "Main interpretability techniques used or proposed", + "circuit_analysis": "Any findings about neural circuits or components", + "relevance_to_alignment": "How this work contributes to AI alignment", + "novel_insights": "New perspectives on model internals", + "limitations": "Limitations of the interpretability methods", + "potential_extensions": "How this work could be extended", + "connection_to_other_work": "Relationship to other interpretability papers" + }} + + Respond with only the JSON. + """ + + # Create message + messages: List[MessageParam] = [ + { + "role": "user", + "content": user_prompt + } + ] + + # Call the API + response = client.messages.create( + model=model_name, + max_tokens=4000, + temperature=0.3, + system=system_prompt, + messages=messages + ) + + # Extract and parse the response + response_text = response.content[0].text if response.content else "" + + # Try to extract JSON + try: + # Find the JSON part in the response + start_idx = response_text.find('{') + end_idx = response_text.rfind('}') + 1 + if start_idx >= 0 and end_idx > start_idx: + json_str = response_text[start_idx:end_idx] + analysis = json.loads(json_str) + return analysis + else: + return {"error": "Could not extract JSON from response"} + except json.JSONDecodeError: + return {"error": "Failed to parse response as JSON"} + + except Exception as e: + return {"error": f"Claude API error: {str(e)}"} \ No newline at end of file diff --git a/src/app.py b/src/app.py deleted file mode 100644 index 0743ad5..0000000 --- a/src/app.py +++ /dev/null @@ -1,195 +0,0 @@ -import gradio as gr -from download_new_papers import get_papers -import utils -from relevancy import generate_relevance_score, process_subject_fields -from sendgrid.helpers.mail import Mail, Email, To, Content -import sendgrid -import os -import openai - -topics = { - "Physics": "", - "Mathematics": "math", - "Computer Science": "cs", - "Quantitative Biology": "q-bio", - "Quantitative Finance": "q-fin", - "Statistics": "stat", - "Electrical Engineering and Systems Science": "eess", - "Economics": "econ" -} - -physics_topics = { - "Astrophysics": "astro-ph", - "Condensed Matter": "cond-mat", - "General Relativity and Quantum Cosmology": "gr-qc", - "High Energy Physics - Experiment": "hep-ex", - "High Energy Physics - Lattice": "hep-lat", - "High Energy Physics - Phenomenology": "hep-ph", - "High Energy Physics - Theory": "hep-th", - "Mathematical Physics": "math-ph", - "Nonlinear Sciences": "nlin", - "Nuclear Experiment": "nucl-ex", - "Nuclear Theory": "nucl-th", - "Physics": "physics", - "Quantum Physics": "quant-ph" -} - -categories_map = { - "Astrophysics": ["Astrophysics of Galaxies", "Cosmology and Nongalactic Astrophysics", "Earth and Planetary Astrophysics", "High Energy Astrophysical Phenomena", "Instrumentation and Methods for Astrophysics", "Solar and Stellar Astrophysics"], - "Condensed Matter": ["Disordered Systems and Neural Networks", "Materials Science", "Mesoscale and Nanoscale Physics", "Other Condensed Matter", "Quantum Gases", "Soft Condensed Matter", "Statistical Mechanics", "Strongly Correlated Electrons", "Superconductivity"], - "General Relativity and Quantum Cosmology": ["None"], - "High Energy Physics - Experiment": ["None"], - "High Energy Physics - Lattice": ["None"], - "High Energy Physics - Phenomenology": ["None"], - "High Energy Physics - Theory": ["None"], - "Mathematical Physics": ["None"], - "Nonlinear Sciences": ["Adaptation and Self-Organizing Systems", "Cellular Automata and Lattice Gases", "Chaotic Dynamics", "Exactly Solvable and Integrable Systems", "Pattern Formation and Solitons"], - "Nuclear Experiment": ["None"], - "Nuclear Theory": ["None"], - "Physics": ["Accelerator Physics", "Applied Physics", "Atmospheric and Oceanic Physics", "Atomic and Molecular Clusters", "Atomic Physics", "Biological Physics", "Chemical Physics", "Classical Physics", "Computational Physics", "Data Analysis, Statistics and Probability", "Fluid Dynamics", "General Physics", "Geophysics", "History and Philosophy of Physics", "Instrumentation and Detectors", "Medical Physics", "Optics", "Physics and Society", "Physics Education", "Plasma Physics", "Popular Physics", "Space Physics"], - "Quantum Physics": ["None"], - "Mathematics": ["Algebraic Geometry", "Algebraic Topology", "Analysis of PDEs", "Category Theory", "Classical Analysis and ODEs", "Combinatorics", "Commutative Algebra", "Complex Variables", "Differential Geometry", "Dynamical Systems", "Functional Analysis", "General Mathematics", "General Topology", "Geometric Topology", "Group Theory", "History and Overview", "Information Theory", "K-Theory and Homology", "Logic", "Mathematical Physics", "Metric Geometry", "Number Theory", "Numerical Analysis", "Operator Algebras", "Optimization and Control", "Probability", "Quantum Algebra", "Representation Theory", "Rings and Algebras", "Spectral Theory", "Statistics Theory", "Symplectic Geometry"], - "Computer Science": ["Artificial Intelligence", "Computation and Language", "Computational Complexity", "Computational Engineering, Finance, and Science", "Computational Geometry", "Computer Science and Game Theory", "Computer Vision and Pattern Recognition", "Computers and Society", "Cryptography and Security", "Data Structures and Algorithms", "Databases", "Digital Libraries", "Discrete Mathematics", "Distributed, Parallel, and Cluster Computing", "Emerging Technologies", "Formal Languages and Automata Theory", "General Literature", "Graphics", "Hardware Architecture", "Human-Computer Interaction", "Information Retrieval", "Information Theory", "Logic in Computer Science", "Machine Learning", "Mathematical Software", "Multiagent Systems", "Multimedia", "Networking and Internet Architecture", "Neural and Evolutionary Computing", "Numerical Analysis", "Operating Systems", "Other Computer Science", "Performance", "Programming Languages", "Robotics", "Social and Information Networks", "Software Engineering", "Sound", "Symbolic Computation", "Systems and Control"], - "Quantitative Biology": ["Biomolecules", "Cell Behavior", "Genomics", "Molecular Networks", "Neurons and Cognition", "Other Quantitative Biology", "Populations and Evolution", "Quantitative Methods", "Subcellular Processes", "Tissues and Organs"], - "Quantitative Finance": ["Computational Finance", "Economics", "General Finance", "Mathematical Finance", "Portfolio Management", "Pricing of Securities", "Risk Management", "Statistical Finance", "Trading and Market Microstructure"], - "Statistics": ["Applications", "Computation", "Machine Learning", "Methodology", "Other Statistics", "Statistics Theory"], - "Electrical Engineering and Systems Science": ["Audio and Speech Processing", "Image and Video Processing", "Signal Processing", "Systems and Control"], - "Economics": ["Econometrics", "General Economics", "Theoretical Economics"] -} - - -def sample(email, topic, physics_topic, categories, interest): - if not topic: - raise gr.Error("You must choose a topic.") - if topic == "Physics": - if isinstance(physics_topic, list): - raise gr.Error("You must choose a physics topic.") - topic = physics_topic - abbr = physics_topics[topic] - else: - abbr = topics[topic] - if categories: - papers = get_papers(abbr) - papers = [ - t for t in papers - if bool(set(process_subject_fields(t['subjects'])) & set(categories))][:4] - else: - papers = get_papers(abbr, limit=4) - if interest: - if not openai.api_key: raise gr.Error("Set your OpenAI api key on the left first") - relevancy, _ = generate_relevance_score( - papers, - query={"interest": interest}, - threshold_score=0, - num_paper_in_prompt=4) - return "\n\n".join([paper["summarized_text"] for paper in relevancy]) - else: - return "\n\n".join(f"Title: {paper['title']}\nAuthors: {paper['authors']}" for paper in papers) - - -def change_subsubject(subject, physics_subject): - if subject != "Physics": - return gr.Dropdown.update(choices=categories_map[subject], value=[], visible=True) - else: - if physics_subject and not isinstance(physics_subject, list): - return gr.Dropdown.update(choices=categories_map[physics_subject], value=[], visible=True) - else: - return gr.Dropdown.update(choices=[], value=[], visible=False) - - -def change_physics(subject): - if subject != "Physics": - return gr.Dropdown.update(visible=False, value=[]) - else: - return gr.Dropdown.update(physics_topics, visible=True) - - -def test(email, topic, physics_topic, categories, interest, key): - if not email: raise gr.Error("Set your email") - if not key: raise gr.Error("Set your SendGrid key") - if topic == "Physics": - if isinstance(physics_topic, list): - raise gr.Error("You must choose a physics topic.") - topic = physics_topic - abbr = physics_topics[topic] - else: - abbr = topics[topic] - if categories: - papers = get_papers(abbr) - papers = [ - t for t in papers - if bool(set(process_subject_fields(t['subjects'])) & set(categories))][:4] - else: - papers = get_papers(abbr, limit=4) - if interest: - if not openai.api_key: raise gr.Error("Set your OpenAI api key on the left first") - relevancy, hallucination = generate_relevance_score( - papers, - query={"interest": interest}, - threshold_score=7, - num_paper_in_prompt=8) - body = "

".join([f'Title: {paper["title"]}
Authors: {paper["authors"]}
Score: {paper["Relevancy score"]}
Reason: {paper["Reasons for match"]}' for paper in relevancy]) - if hallucination: - body = "Warning: the model hallucinated some papers. We have tried to remove them, but the scores may not be accurate.

" + body - else: - body = "

".join([f'Title: {paper["title"]}
Authors: {paper["authors"]}' for paper in papers]) - sg = sendgrid.SendGridAPIClient(api_key=key) - from_email = Email(email) - to_email = To(email) - subject = "arXiv digest" - content = Content("text/html", body) - mail = Mail(from_email, to_email, subject, content) - mail_json = mail.get() - - # Send an HTTP POST request to /mail/send - response = sg.client.mail.send.post(request_body=mail_json) - if response.status_code >= 200 and response.status_code <= 300: - return "Success!" - else: - return "Failure: ({response.status_code})" - - -def register_openai_token(token): - openai.api_key = token - -with gr.Blocks() as demo: - with gr.Row(): - with gr.Column(scale=1): - token = gr.Textbox(label="OpenAI API Key", type="password") - subject = gr.Radio( - list(topics.keys()), label="Topic" - ) - physics_subject = gr.Dropdown(physics_topics, value=[], multiselect=False, label="Physics category", visible=False, info="") - subsubject = gr.Dropdown( - [], value=[], multiselect=True, label="Subtopic", info="Optional. Leaving it empty will use all subtopics.", visible=False) - subject.change(fn=change_physics, inputs=[subject], outputs=physics_subject) - subject.change(fn=change_subsubject, inputs=[subject, physics_subject], outputs=subsubject) - physics_subject.change(fn=change_subsubject, inputs=[subject, physics_subject], outputs=subsubject) - - interest = gr.Textbox(label="A natural language description of what you are interested in. We will generate relevancy scores (1-10) and explanations for the papers in the selected topics according to this statement.", info="Press shift-enter or click the button below to update.", lines=7) - sample_btn = gr.Button("Generate Digest") - sample_output = gr.Textbox(label="Results for your configuration.", info="For runtime purposes, this is only done on a small subset of recent papers in the topic you have selected. Papers will not be filtered by relevancy, only sorted on a scale of 1-10.") - with gr.Column(scale=0.40): - with gr.Box(): - title = gr.Markdown( - """ - # Email Setup, Optional - Send an email to the below address using the configuration on the right. Requires a sendgrid token. These values are not needed to use the right side of this page. - - To create a scheduled job for this, see our [Github Repository](https://github.com/AutoLLM/ArxivDigest) - """, - interactive=False, show_label=False) - email = gr.Textbox(label="Email address", type="email", placeholder="") - sendgrid_token = gr.Textbox(label="SendGrid API Key", type="password") - with gr.Row(): - test_btn = gr.Button("Send email") - output = gr.Textbox(show_label=False, placeholder="email status") - test_btn.click(fn=test, inputs=[email, subject, physics_subject, subsubject, interest, sendgrid_token], outputs=output) - token.change(fn=register_openai_token, inputs=[token]) - sample_btn.click(fn=sample, inputs=[email, subject, physics_subject, subsubject, interest], outputs=sample_output) - subject.change(fn=sample, inputs=[email, subject, physics_subject, subsubject, interest], outputs=sample_output) - physics_subject.change(fn=sample, inputs=[email, subject, physics_subject, subsubject, interest], outputs=sample_output) - subsubject.change(fn=sample, inputs=[email, subject, physics_subject, subsubject, interest], outputs=sample_output) - interest.submit(fn=sample, inputs=[email, subject, physics_subject, subsubject, interest], outputs=sample_output) - -demo.launch(show_api=False) diff --git a/src/app_new.py b/src/app_new.py new file mode 100755 index 0000000..08db449 --- /dev/null +++ b/src/app_new.py @@ -0,0 +1,1101 @@ +import gradio as gr +from download_new_papers import get_papers +import utils +from relevancy import generate_relevance_score, process_subject_fields + +import os +import openai +import datetime +import yaml +from paths import DATA_DIR, DIGEST_DIR +from model_manager import model_manager, ModelProvider +from gemini_utils import setup_gemini_api, get_topic_clustering + +# Load config file +def load_config(): + config_path = os.path.join(os.path.dirname(os.path.dirname(os.path.abspath(__file__))), "config.yaml") + try: + with open(config_path, 'r') as file: + return yaml.safe_load(file) + except Exception as e: + print(f"Error loading config: {e}") + return {"threshold": 2} # Default threshold if config loading fails + +config = load_config() + +# Helper function to filter papers by threshold +def filter_papers_by_threshold(papers, threshold): + """Filter papers by relevancy score threshold""" + print(f"\n===== FILTERING PAPERS =====") + print(f"Only showing papers with relevancy score >= {threshold}") + print(f"(Change this value in config.yaml if needed)") + + # Debug the paper scores + for i, paper in enumerate(papers): + print(f"Paper {i+1} - Title: {paper.get('title', 'No title')}") + print(f" - Score: {paper.get('Relevancy score', 'No score')}") + print(f" - Fields: {list(paper.keys())}") + + # First extract data from gemini_analysis if it exists and Relevancy score doesn't + for paper in papers: + if "gemini_analysis" in paper and "Relevancy score" not in paper: + print(f"Extracting analysis data for paper: {paper.get('title')}") + gemini_data = paper["gemini_analysis"] + + # Map Gemini analysis fields to expected fields + field_mapping = { + "relevance_score": "Relevancy score", + "relationship_score": "Relevancy score", + "paper_relevance": "Relevancy score", + "paper's_relationship_to_the_user's_interests": "Relevancy score", + "key_innovations": "Key innovations", + "critical_analysis": "Critical analysis", + "methodology_summary": "Methodology", + "technical_significance": "Critical analysis", + "related_research": "Related work" + } + + # Copy fields using mapping + for gemini_field, expected_field in field_mapping.items(): + if gemini_field in gemini_data: + paper[expected_field] = gemini_data[gemini_field] + print(f" - Mapped {gemini_field} to {expected_field}") + + # If we have no score yet, look for a number in other fields + if "Relevancy score" not in paper: + # Try to find a relevance score in any field + for field, value in gemini_data.items(): + if isinstance(value, (int, float)) and 1 <= value <= 10: + paper["Relevancy score"] = value + print(f" - Found score {value} in field {field}") + break + elif isinstance(value, str) and "score" in field.lower(): + try: + # Try to extract a number from the string + import re + numbers = re.findall(r'\d+', value) + if numbers: + score = int(numbers[0]) + if 1 <= score <= 10: # Validate score range + paper["Relevancy score"] = score + print(f" - Extracted score {score} from {field}: {value}") + break + except: + pass + + # If still no score, default to threshold to include paper + if "Relevancy score" not in paper: + paper["Relevancy score"] = threshold + print(f" - Assigned default score {threshold}") + + # Add some reasonable defaults for missing fields + if "Reasons for match" not in paper and "topic_classification" in gemini_data: + paper["Reasons for match"] = gemini_data.get("topic_classification", "Not provided") + + # Set missing fields with default values + for field in ["Key innovations", "Critical analysis", "Goal", "Data", "Methodology", + "Implementation details", "Experiments & Results", "Discussion & Next steps", + "Related work", "Practical applications", "Key takeaways"]: + if field not in paper: + paper[field] = "Not available in analysis" + + # Ensure scores are properly parsed to integers + for paper in papers: + if "Relevancy score" in paper and not isinstance(paper["Relevancy score"], int): + try: + if isinstance(paper["Relevancy score"], str) and "/" in paper["Relevancy score"]: + paper["Relevancy score"] = int(paper["Relevancy score"].split("/")[0]) + else: + paper["Relevancy score"] = int(paper["Relevancy score"]) + except (ValueError, TypeError): + print(f"WARNING: Could not convert score '{paper.get('Relevancy score')}' to integer for paper: {paper.get('title')}") + paper["Relevancy score"] = threshold # Use threshold as default + + # Make sure all papers have required fields + required_fields = [ + "Relevancy score", "Reasons for match", "Key innovations", "Critical analysis", + "Goal", "Data", "Methodology", "Implementation details", "Experiments & Results", + "Git", "Discussion & Next steps", "Related work", "Practical applications", + "Key takeaways" + ] + + for paper in papers: + # Make sure it has a relevancy score + if "Relevancy score" not in paper: + paper["Relevancy score"] = threshold + print(f"Assigned default threshold score to paper: {paper.get('title')}") + + # Add missing fields with default values - always ensure all fields exist + for field in required_fields: + if field not in paper or paper[field] is None: + paper[field] = f"Not available in the paper content" + print(f"Added missing field {field} to paper: {paper.get('title')}") + elif isinstance(paper[field], str) and (not paper[field].strip() or + paper[field] == "Not provided" or paper[field] == "Not available in analysis"): + paper[field] = f"Not available in the paper content" + print(f"Replaced placeholder for field {field} in paper: {paper.get('title')}") + + # Now filter papers + filtered_papers = [p for p in papers if p.get("Relevancy score", 0) >= threshold] + print(f"After filtering: {len(filtered_papers)} papers remain out of {len(papers)}") + + # If fewer than 10 papers passed the filter, add the highest-scoring papers below threshold + # This ensures we always show a reasonable number of papers + if len(filtered_papers) < 10 and len(papers) > len(filtered_papers): + print(f"WARNING: Only {len(filtered_papers)} papers passed the threshold filter. Adding more papers.") + # Sort remaining papers by score and add the highest scoring ones + remaining_papers = [p for p in papers if p not in filtered_papers] + remaining_papers.sort(key=lambda x: x.get("Relevancy score", 0), reverse=True) + # Add enough papers to get to 10 or all remaining papers, whichever is less + additional_count = min(10 - len(filtered_papers), len(remaining_papers)) + filtered_papers.extend(remaining_papers[:additional_count]) + print(f"Added {additional_count} additional papers below threshold. Total papers: {len(filtered_papers)}") + + # Fallback if no papers passed the filter + if len(filtered_papers) == 0 and len(papers) > 0: + print("WARNING: No papers passed the threshold filter. Using all papers.") + filtered_papers = papers + + return filtered_papers +from design_automation import ( + is_design_automation_paper, + categorize_design_paper, + analyze_design_techniques, + extract_design_metrics, + get_related_design_papers, + create_design_analysis_prompt +) + +topics = { + "Physics": "", + "Mathematics": "math", + "Computer Science": "cs", + "Quantitative Biology": "q-bio", + "Quantitative Finance": "q-fin", + "Statistics": "stat", + "Electrical Engineering and Systems Science": "eess", + "Economics": "econ" +} + +physics_topics = { + "Astrophysics": "astro-ph", + "Condensed Matter": "cond-mat", + "General Relativity and Quantum Cosmology": "gr-qc", + "High Energy Physics - Experiment": "hep-ex", + "High Energy Physics - Lattice": "hep-lat", + "High Energy Physics - Phenomenology": "hep-ph", + "High Energy Physics - Theory": "hep-th", + "Mathematical Physics": "math-ph", + "Nonlinear Sciences": "nlin", + "Nuclear Experiment": "nucl-ex", + "Nuclear Theory": "nucl-th", + "Physics": "physics", + "Quantum Physics": "quant-ph" +} + +categories_map = { + "Astrophysics": ["Astrophysics of Galaxies", "Cosmology and Nongalactic Astrophysics", "Earth and Planetary Astrophysics", "High Energy Astrophysical Phenomena", "Instrumentation and Methods for Astrophysics", "Solar and Stellar Astrophysics"], + "Condensed Matter": ["Disordered Systems and Neural Networks", "Materials Science", "Mesoscale and Nanoscale Physics", "Other Condensed Matter", "Quantum Gases", "Soft Condensed Matter", "Statistical Mechanics", "Strongly Correlated Electrons", "Superconductivity"], + "General Relativity and Quantum Cosmology": ["None"], + "High Energy Physics - Experiment": ["None"], + "High Energy Physics - Lattice": ["None"], + "High Energy Physics - Phenomenology": ["None"], + "High Energy Physics - Theory": ["None"], + "Mathematical Physics": ["None"], + "Nonlinear Sciences": ["Adaptation and Self-Organizing Systems", "Cellular Automata and Lattice Gases", "Chaotic Dynamics", "Exactly Solvable and Integrable Systems", "Pattern Formation and Solitons"], + "Nuclear Experiment": ["None"], + "Nuclear Theory": ["None"], + "Physics": ["Accelerator Physics", "Applied Physics", "Atmospheric and Oceanic Physics", "Atomic and Molecular Clusters", "Atomic Physics", "Biological Physics", "Chemical Physics", "Classical Physics", "Computational Physics", "Data Analysis, Statistics and Probability", "Fluid Dynamics", "General Physics", "Geophysics", "History and Philosophy of Physics", "Instrumentation and Detectors", "Medical Physics", "Optics", "Physics and Society", "Physics Education", "Plasma Physics", "Popular Physics", "Space Physics"], + "Quantum Physics": ["None"], + "Mathematics": ["Algebraic Geometry", "Algebraic Topology", "Analysis of PDEs", "Category Theory", "Classical Analysis and ODEs", "Combinatorics", "Commutative Algebra", "Complex Variables", "Differential Geometry", "Dynamical Systems", "Functional Analysis", "General Mathematics", "General Topology", "Geometric Topology", "Group Theory", "History and Overview", "Information Theory", "K-Theory and Homology", "Logic", "Mathematical Physics", "Metric Geometry", "Number Theory", "Numerical Analysis", "Operator Algebras", "Optimization and Control", "Probability", "Quantum Algebra", "Representation Theory", "Rings and Algebras", "Spectral Theory", "Statistics Theory", "Symplectic Geometry"], + "Computer Science": ["Artificial Intelligence", "Computation and Language", "Computational Complexity", "Computational Engineering, Finance, and Science", "Computational Geometry", "Computer Science and Game Theory", "Computer Vision and Pattern Recognition", "Computers and Society", "Cryptography and Security", "Data Structures and Algorithms", "Databases", "Digital Libraries", "Discrete Mathematics", "Distributed, Parallel, and Cluster Computing", "Emerging Technologies", "Formal Languages and Automata Theory", "General Literature", "Graphics", "Hardware Architecture", "Human-Computer Interaction", "Information Retrieval", "Information Theory", "Logic in Computer Science", "Machine Learning", "Mathematical Software", "Multiagent Systems", "Multimedia", "Networking and Internet Architecture", "Neural and Evolutionary Computing", "Numerical Analysis", "Operating Systems", "Other Computer Science", "Performance", "Programming Languages", "Robotics", "Social and Information Networks", "Software Engineering", "Sound", "Symbolic Computation", "Systems and Control"], + "Quantitative Biology": ["Biomolecules", "Cell Behavior", "Genomics", "Molecular Networks", "Neurons and Cognition", "Other Quantitative Biology", "Populations and Evolution", "Quantitative Methods", "Subcellular Processes", "Tissues and Organs"], + "Quantitative Finance": ["Computational Finance", "Economics", "General Finance", "Mathematical Finance", "Portfolio Management", "Pricing of Securities", "Risk Management", "Statistical Finance", "Trading and Market Microstructure"], + "Statistics": ["Applications", "Computation", "Machine Learning", "Methodology", "Other Statistics", "Statistics Theory"], + "Electrical Engineering and Systems Science": ["Audio and Speech Processing", "Image and Video Processing", "Signal Processing", "Systems and Control"], + "Economics": ["Econometrics", "General Economics", "Theoretical Economics"] +} + + +def generate_html_report(papers, title="ArXiv Digest Results", topic=None, category=None, query=None): + """Generate an HTML report for the papers and save to file. + + Args: + papers: List of paper dictionaries + title: Title for the HTML report + topic: Optional topic name for filename + category: Optional category name for filename + query: Optional dictionary with interest field for research interests + + Returns: + Path to the HTML file + """ + # Debug: Log what fields are available in each paper + print(f"Generating HTML report for {len(papers)} papers") + for i, paper in enumerate(papers): + print(f"Paper {i+1} fields: {list(paper.keys())}") + if "Key innovations" in paper: + print(f"Paper {i+1} has Key innovations: {paper['Key innovations'][:50]}...") + if "Critical analysis" in paper: + print(f"Paper {i+1} has Critical analysis: {paper['Critical analysis'][:50]}...") + + # Create a date for the filename (without time) + date = datetime.datetime.now().strftime("%Y%m%d") + + # Create filename with topic if provided + if topic: + # Clean up topic name for filename (remove spaces, etc.) + topic_clean = topic.lower().replace(" ", "_").replace("/", "_") + html_file = os.path.join(DIGEST_DIR, f"arxiv_digest_{topic_clean}_{date}.html") + else: + html_file = os.path.join(DIGEST_DIR, f"arxiv_digest_{date}.html") + + html = f""" + + + + + {title} + + + +

{title}

+
+

Generated on {datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")}

+

Found {len(papers)} papers

+

Topics: {topic or "All"}

+

Threshold: {query.get("threshold", config.get("threshold", "Not specified"))}

+
+ +
+

Research Interests:

+
{query.get('interest', 'Not specified')}
+
+ + +
+

Index

+ +
+ """ + + # Check if we have any papers + if not papers: + html += """ +
+
No papers found matching your criteria
+
+

No papers met the relevancy threshold criteria. You can:

+ +
+
+ """.format(threshold=config.get("threshold", 2)) + + # Add papers + for i, paper in enumerate(papers): + paper_id = f"paper-{i}" + html += f""" +
+
{paper.get("title", "No title")}
+
{paper.get("authors", "Unknown authors")}
+
Subject: {paper.get("subjects", "N/A")}
+ """ + + # Add relevancy score and reasons if available + if "Relevancy score" in paper: + html += f'
Relevancy Score: {paper.get("Relevancy score", "N/A")}
' + + if "Reasons for match" in paper: + html += f'
Reason for Relevance: {paper.get("Reasons for match", "")}
' + + # Add design information if available + if "design_category" in paper or "design_techniques" in paper: + html += '
' + if "design_category" in paper: + html += f'
Design Category: {paper.get("design_category", "")}
' + if "design_techniques" in paper: + html += f'
Design Techniques: {", ".join(paper.get("design_techniques", []))}
' + html += '
' + + # Add abstract + if "abstract" in paper: + html += f'
Abstract: {paper.get("abstract", "")}
' + + # Helper function to format field content properly + def format_field_content(content): + if isinstance(content, list): + # Format list items with bullet points + return '' + else: + return content + + # Add key innovations and critical analysis with special styling + if "Key innovations" in paper: + formatted_innovations = format_field_content(paper.get("Key innovations", "")) + html += f'
Key Innovations:
{formatted_innovations}
' + print(f"Added Key innovations for paper {i+1}") + else: + print(f"Paper {i+1} is missing Key innovations field") + + if "Critical analysis" in paper: + formatted_analysis = format_field_content(paper.get("Critical analysis", "")) + html += f'
Critical Analysis:
{formatted_analysis}
' + print(f"Added Critical analysis for paper {i+1}") + else: + print(f"Paper {i+1} is missing Critical analysis field") + + # Add goal + if "Goal" in paper: + formatted_goal = format_field_content(paper.get("Goal", "")) + html += f'
Goal:
{formatted_goal}
' + print(f"Added Goal for paper {i+1}") + else: + print(f"Paper {i+1} is missing Goal field") + + # Add Data + if "Data" in paper: + formatted_data = format_field_content(paper.get("Data", "")) + html += f'
Data:
{formatted_data}
' + print(f"Added Data for paper {i+1}") + else: + print(f"Paper {i+1} is missing Data field") + + # Add Methodology + if "Methodology" in paper: + formatted_methodology = format_field_content(paper.get("Methodology", "")) + html += f'
Methodology:
{formatted_methodology}
' + print(f"Added Methodology for paper {i+1}") + else: + print(f"Paper {i+1} is missing Methodology field") + + # Add implementation details + if "Implementation details" in paper: + formatted_details = format_field_content(paper.get("Implementation details", "")) + html += f'
Implementation Details:
{formatted_details}
' + print(f"Added Implementation details for paper {i+1}") + else: + print(f"Paper {i+1} is missing Implementation details field") + + # Add experiments and results + if "Experiments & Results" in paper: + formatted_results = format_field_content(paper.get("Experiments & Results", "")) + html += f'
Experiments & Results:
{formatted_results}
' + print(f"Added Experiments & Results for paper {i+1}") + else: + print(f"Paper {i+1} is missing Experiments & Results field") + + # Add discussion and next steps + if "Discussion & Next steps" in paper: + formatted_discussion = format_field_content(paper.get("Discussion & Next steps", "")) + html += f'
Discussion & Next Steps:
{formatted_discussion}
' + print(f"Added Discussion & Next steps for paper {i+1}") + else: + print(f"Paper {i+1} is missing Discussion & Next steps field") + + # Add Related work + if "Related work" in paper: + formatted_related = format_field_content(paper.get("Related work", "")) + html += f'
Related Work:
{formatted_related}
' + print(f"Added Related work for paper {i+1}") + else: + print(f"Paper {i+1} is missing Related work field") + + # Add Practical applications + if "Practical applications" in paper: + formatted_applications = format_field_content(paper.get("Practical applications", "")) + html += f'
Practical Applications:
{formatted_applications}
' + print(f"Added Practical applications for paper {i+1}") + else: + print(f"Paper {i+1} is missing Practical applications field") + + # Add Key takeaways + if "Key takeaways" in paper: + formatted_takeaways = format_field_content(paper.get("Key takeaways", "")) + html += f'
Key Takeaways:
{formatted_takeaways}
' + print(f"Added Key takeaways for paper {i+1}") + else: + print(f"Paper {i+1} is missing Key takeaways field") + + # Add remaining sections that weren't already handled specifically above + for key, value in paper.items(): + # Skip fields we've already handled or don't want to display + if key in ["title", "authors", "subjects", "main_page", "Relevancy score", "Reasons for match", + "design_category", "design_techniques", "summarized_text", "abstract", "content", + "Key innovations", "Critical analysis", "Goal", "Data", "Methodology", + "Implementation details", "Experiments & Results", "Discussion & Next steps", + "Related work", "Practical applications", "Key takeaways"]: + continue + + if isinstance(value, str) and value.strip(): + # Choose appropriate styling based on the section + section_class = "section" + + if "goal" in key.lower() or "aim" in key.lower(): + section_class = "key-section" + elif "data" in key.lower() or "methodology" in key.lower(): + section_class = "implementation" + elif "related" in key.lower() or "practical" in key.lower() or "takeaway" in key.lower(): + section_class = "discussion" + + formatted_value = format_field_content(value) + html += f'
{key}:
{formatted_value}
' + print(f"Added additional field {key} for paper {i+1}") + + # Add links + html += f""" + +
+ """ + + html += """ + + + + """ + + # Save the HTML file + with open(html_file, "w") as f: + f.write(html) + + print(f"Saved HTML report to {html_file}") + return html_file + +def sample(email, topic, physics_topic, categories, interest, use_openai, use_gemini, use_anthropic, + openai_model, gemini_model, anthropic_model, special_analysis, threshold_from_ui, custom_batch_size, custom_batch_number, + custom_prompt_batch_size, mechanistic_interpretability, technical_ai_safety, + design_automation, design_reference_paper, design_techniques, design_categories): + print(f"\n===== STARTING TWO-STAGE PAPER ANALYSIS =====") + print(f"Topic: {topic}") + print(f"Research interests: {interest[:100]}...") + print(f"Using threshold: {threshold_from_ui}") + print(f"Stage 1 (Filtering): OpenAI {openai_model}") + print(f"Stage 2 (Analysis): {'Gemini ' + gemini_model if use_gemini else 'OpenAI ' + openai_model}") + print(f"UI Batch size: {custom_batch_size} papers") + print(f"Prompt batch size: {custom_prompt_batch_size} papers per prompt") + print(f"===============================================") + if not topic: + raise gr.Error("You must choose a topic.") + if topic == "Physics": + if isinstance(physics_topic, list): + raise gr.Error("You must choose a physics topic.") + topic = physics_topic + abbr = physics_topics[topic] + else: + abbr = topics[topic] + + # Check if at least one model is selected + if not (use_openai or use_gemini or use_anthropic): + raise gr.Error("You must select at least one model provider (OpenAI, Gemini, or Claude)") + + # Get papers based on categories + if categories: + all_papers = get_papers(abbr) + all_papers = [ + t for t in all_papers + if bool(set(process_subject_fields(t['subjects'])) & set(categories))] + print(f"Found {len(all_papers)} papers matching categories: {categories}") + else: + all_papers = get_papers(abbr) + print(f"Found {len(all_papers)} papers for topic: {topic}") + + # Always process all papers + papers = all_papers + total_papers = len(all_papers) + print(f"Processing all {total_papers} papers") + + # Fixed parameters: + # - Stage 1: 8 papers per batch for relevancy scoring (title & abstract only) + # - Stage 2: Detailed analysis of papers that meet threshold + # - Minimum 10 papers (will include top-scoring papers below threshold if needed) + + if interest: + # Build list of providers to use + providers = [] + model_names = {} + + if use_openai: + if not model_manager.is_provider_available(ModelProvider.OPENAI): + if not openai.api_key: + raise gr.Error("Set your OpenAI API key in the OpenAI tab first") + else: + model_manager.register_openai(openai.api_key) + providers.append(ModelProvider.OPENAI) + model_names[ModelProvider.OPENAI] = openai_model + + if use_gemini: + if not model_manager.is_provider_available(ModelProvider.GEMINI): + raise gr.Error("Set your Gemini API key in the Gemini tab first") + providers.append(ModelProvider.GEMINI) + model_names[ModelProvider.GEMINI] = gemini_model + + if use_anthropic: + if not model_manager.is_provider_available(ModelProvider.ANTHROPIC): + raise gr.Error("Set your Anthropic API key in the Anthropic tab first") + providers.append(ModelProvider.ANTHROPIC) + model_names[ModelProvider.ANTHROPIC] = anthropic_model + + # Check if we need to find design automation papers + if design_automation: + # Filter for design automation papers + design_papers = [p for p in papers if is_design_automation_paper(p)] + + # Filter by techniques if specified + if design_techniques: + filtered_papers = [] + for paper in design_papers: + paper_techniques = analyze_design_techniques(paper) + if any(technique in design_techniques for technique in paper_techniques): + filtered_papers.append(paper) + design_papers = filtered_papers if filtered_papers else design_papers + + # Filter by categories if specified + if design_categories: + filtered_papers = [] + for paper in design_papers: + paper_category = categorize_design_paper(paper) + if any(category in paper_category for category in design_categories): + filtered_papers.append(paper) + design_papers = filtered_papers if filtered_papers else design_papers + + # Find related papers if reference paper is specified + if design_reference_paper: + related_papers = get_related_design_papers(design_reference_paper, papers) + if related_papers: + design_papers = related_papers + + # Use these papers if we found any, otherwise fallback to regular papers + if design_papers: + papers = design_papers + + # Process papers directly instead of using model_manager + print("\n===== ANALYZING PAPERS FOR EMAIL =====") + print(f"Processing {len(papers)} papers...") + relevancy = [] + hallucination = False + + # Use OpenAI if selected + if use_openai and model_manager.is_provider_available(ModelProvider.OPENAI): + try: + # Import directly to avoid circular imports + from relevancy import generate_relevance_score + openai_results, hallu = generate_relevance_score( + papers, + query={"interest": interest}, + model_name=openai_model, + threshold_score=int(threshold_from_ui), # Apply threshold from UI slider + num_paper_in_prompt=int(custom_prompt_batch_size), # Use the user-specified prompt batch size + stage2_model=gemini_model if use_gemini else "gpt-4-turbo" # Use Gemini for stage 2 if selected + ) + hallucination = hallucination or hallu + relevancy.extend(openai_results) + print(f"OpenAI analysis added {len(openai_results)} papers") + except Exception as e: + print(f"Error during OpenAI analysis: {e}") + + # Use Gemini if selected and no papers yet + if use_gemini and model_manager.is_provider_available(ModelProvider.GEMINI) and len(relevancy) == 0: + try: + # Import directly to avoid circular imports + from gemini_utils import analyze_papers_with_gemini + gemini_papers = analyze_papers_with_gemini( + papers, + query={"interest": interest}, + model_name=gemini_model + ) + # Process papers to ensure they have the right fields + for paper in gemini_papers: + if 'gemini_analysis' in paper: + # Copy all fields from gemini_analysis to the paper object + for key, value in paper['gemini_analysis'].items(): + paper[key] = value + + relevancy.extend(gemini_papers) + print(f"Gemini analysis added {len(gemini_papers)} papers") + except Exception as e: + print(f"Error during Gemini analysis: {e}") + + # Use Anthropic if selected and no papers yet + if use_anthropic and model_manager.is_provider_available(ModelProvider.ANTHROPIC) and len(relevancy) == 0: + try: + # Import directly to avoid circular imports + from anthropic_utils import analyze_papers_with_claude + claude_papers = analyze_papers_with_claude( + papers, + query={"interest": interest}, + model_name=anthropic_model + ) + # Process papers to ensure they have the right fields + for paper in claude_papers: + if 'claude_analysis' in paper: + # Copy all fields from claude_analysis to the paper object + for key, value in paper['claude_analysis'].items(): + paper[key] = value + + relevancy.extend(claude_papers) + print(f"Claude analysis added {len(claude_papers)} papers") + except Exception as e: + print(f"Error during Claude analysis: {e}") + + print(f"Total papers after analysis: {len(relevancy)}") + + # Papers are already filtered by threshold during LLM response processing + # This is now just a safety check to ensure we didn't miss any + threshold_value = int(threshold_from_ui) if threshold_from_ui is not None else config.get("threshold", 2) + print(f"Using relevancy threshold: {threshold_value}") + print(f"Papers before final threshold check: {len(relevancy)}") + relevancy = filter_papers_by_threshold(relevancy, threshold_value) + print(f"Papers after final threshold check: {len(relevancy)}") + + # Add design automation information if requested + if design_automation and relevancy: + for paper in relevancy: + paper["design_category"] = categorize_design_paper(paper) + paper["design_techniques"] = analyze_design_techniques(paper) + paper["design_metrics"] = extract_design_metrics(paper) + + # Perform detailed design automation analysis on highest scored papers + if paper.get("Relevancy score", 0) >= 7 and (use_openai or use_gemini or use_anthropic): + # Select provider for design analysis + provider = None + model = None + + if use_openai and model_manager.is_provider_available(ModelProvider.OPENAI): + provider = ModelProvider.OPENAI + model = openai_model + elif use_gemini and model_manager.is_provider_available(ModelProvider.GEMINI): + provider = ModelProvider.GEMINI + model = gemini_model + elif use_anthropic and model_manager.is_provider_available(ModelProvider.ANTHROPIC): + provider = ModelProvider.ANTHROPIC + model = anthropic_model + + if provider: + design_analysis = model_manager.analyze_design_automation( + paper, + provider=provider, + model_name=model + ) + if design_analysis and "error" not in design_analysis: + paper["design_analysis"] = design_analysis + + # Add specialized analysis if requested + if special_analysis and len(relevancy) > 0: + # Get topic clustering from Gemini if available + if use_gemini and model_manager.is_provider_available(ModelProvider.GEMINI): + try: + clusters = get_topic_clustering(relevancy, model_name=gemini_model) + cluster_info = "\n\n=== TOPIC CLUSTERS ===\n" + for i, cluster in enumerate(clusters.get("clusters", [])): + cluster_info += f"\nCluster {i+1}: {cluster.get('name')}\n" + cluster_info += f"Papers: {', '.join([str(p) for p in cluster.get('papers', [])])}\n" + cluster_info += f"Description: {cluster.get('description')}\n" + + # Add cluster info to the output + cluster_summary = "\n\n" + cluster_info + "\n\n" + except Exception as e: + cluster_summary = f"\n\nError generating clusters: {str(e)}\n\n" + else: + cluster_summary = "" + + # Add specialized mechanistic interpretability analysis if requested + if mechanistic_interpretability and len(relevancy) > 0: + # Use the first available provider in order of preference + preferred_providers = [ + (ModelProvider.ANTHROPIC, anthropic_model if use_anthropic else None), + (ModelProvider.OPENAI, openai_model if use_openai else None), + (ModelProvider.GEMINI, gemini_model if use_gemini else None) + ] + + provider = None + model = None + for p, m in preferred_providers: + if model_manager.is_provider_available(p) and m: + provider = p + model = m + break + + if provider: + try: + interp_analysis = model_manager.get_mechanistic_interpretability_analysis( + relevancy[0], # Analyze the most relevant paper + provider=provider, + model_name=model + ) + + interp_summary = "\n\n=== MECHANISTIC INTERPRETABILITY ANALYSIS ===\n" + for key, value in interp_analysis.items(): + if key != "error" and key != "raw_content": + interp_summary += f"\n{key}: {value}\n" + + # Add interpretability analysis to the output + interpretability_info = "\n\n" + interp_summary + "\n\n" + except Exception as e: + interpretability_info = f"\n\nError generating interpretability analysis: {str(e)}\n\n" + else: + interpretability_info = "\n\nNo available provider for interpretability analysis.\n\n" + else: + interpretability_info = "" + + # Generate HTML report + html_file = generate_html_report( + relevancy, + title=f"ArXiv Digest: {topic} papers", + topic=topic, + query={"interest": interest, "threshold": threshold_value} + ) + + # Create summary texts for display + summary_texts = [] + for paper in relevancy: + if "summarized_text" in paper: + summary_texts.append(paper["summarized_text"]) + else: + # Create a summary if summarized_text doesn't exist + summary = f"Title: {paper.get('title', 'No title')}\n" + summary += f"Authors: {paper.get('authors', 'Unknown')}\n" + summary += f"Score: {paper.get('Relevancy score', 'N/A')}\n" + summary += f"Abstract: {paper.get('abstract', 'No abstract')[:200]}...\n" + summary_texts.append(summary) + + result_text = cluster_summary + "\n\n".join(summary_texts) + interpretability_info + return result_text + f"\n\nHTML report saved to: {html_file}" + else: + # Generate HTML report + html_file = generate_html_report( + relevancy, + title=f"ArXiv Digest: {topic} papers", + topic=topic, + query={"interest": interest, "threshold": threshold_value} + ) + + # Create summary texts for display + summary_texts = [] + for paper in relevancy: + if "summarized_text" in paper: + summary_texts.append(paper["summarized_text"]) + else: + # Create a summary if summarized_text doesn't exist + summary = f"Title: {paper.get('title', 'No title')}\n" + summary += f"Authors: {paper.get('authors', 'Unknown')}\n" + summary += f"Score: {paper.get('Relevancy score', 'N/A')}\n" + summary += f"Abstract: {paper.get('abstract', 'No abstract')[:200]}...\n" + summary_texts.append(summary) + + result_text = "\n\n".join(summary_texts) + return result_text + f"\n\nHTML report saved to: {html_file}" + else: + # Generate HTML report for basic results + html_file = generate_html_report( + papers, + title=f"ArXiv Digest: {topic} papers", + topic=topic, + query={"interest": interest, "threshold": threshold_from_ui if "threshold_from_ui" in locals() else config.get("threshold", 2)} + ) + result_text = "\n\n".join(f"Title: {paper['title']}\nAuthors: {paper['authors']}" for paper in papers) + return result_text + f"\n\nHTML report saved to: {html_file}" + + +def change_subsubject(subject, physics_subject): + # For any subject (not just Physics), show appropriate subtopics + if subject == "Physics" and physics_subject and not isinstance(physics_subject, list): + # For Physics, show subcategories based on selected physics category + return {"choices": categories_map[physics_subject], "value": [], "visible": True} + elif subject in categories_map: + # For other main topics, show their subtopics directly + return {"choices": categories_map[subject], "value": [], "visible": True} + else: + # If no subtopics available + return {"choices": [], "value": [], "visible": False} + + +def change_physics(subject): + # Always return just the visibility attribute to avoid errors with value updates + if subject != "Physics": + return {"visible": False} + else: + return {"visible": True} + + +def register_openai_token(token): + openai.api_key = token + model_manager.register_openai(token) + +def register_gemini_token(token): + setup_gemini_api(token) + model_manager.register_gemini(token) + +def register_anthropic_token(token): + model_manager.register_anthropic(token) + +# Custom CSS +custom_css = """ +#main-title h1 { + text-align: center; + font-size: 2.5rem; + font-weight: bold; + color: #2c3e50; + margin-bottom: 0.5rem; +} + +.banner-image { + display: flex; + justify-content: center; + margin: 0 auto; +} +""" + +with gr.Blocks(css=custom_css) as demo: + with gr.Column(): + # Title first, then banner image + with gr.Row(): + gr.Markdown(""" + # ArXiv Digest + """, elem_id="main-title") + + # Smaller banner image, centered + with gr.Row(elem_classes="banner-image"): + gr.Image(value="./readme_images/main_banner.png", show_label=False, width=250) + + with gr.Row(): + gr.Markdown(""" + **Personalized arXiv Paper Recommendations with LLMs** + + This app helps you discover relevant academic papers from arXiv based on your research interests. + It uses a two-stage processing system: first filtering papers for relevance, then analyzing them in depth. + + [GitHub Repository](https://github.com/linhkid/ArxivDigest-extra) β€’ [Report an Issue](https://github.com/linhkid/ArxivDigest-extra/issues) + """) + + with gr.Tabs(): + with gr.TabItem("OpenAI"): + openai_token = gr.Textbox(label="OpenAI API Key", type="password") + openai_token.change(fn=register_openai_token, inputs=[openai_token]) + + with gr.TabItem("Gemini"): + gemini_token = gr.Textbox(label="Gemini API Key", type="password") + gemini_token.change(fn=register_gemini_token, inputs=[gemini_token]) + + with gr.TabItem("Anthropic"): + anthropic_token = gr.Textbox(label="Anthropic API Key", type="password") + anthropic_token.change(fn=register_anthropic_token, inputs=[anthropic_token]) + + subject = gr.Radio( + list(topics.keys()), label="Topic" + ) + # Only show physics dropdown when Physics is selected + physics_subject = gr.Dropdown(list(physics_topics.keys()), value=list(physics_topics.keys())[0], + multiselect=False, label="Physics category", visible=False) + subsubject = gr.Dropdown( + [], value=[], multiselect=True, + label="Subtopic (optional)", info="Optional. Leaving it empty will use all subtopics.", visible=True) + + # Use interest from config.yaml as default value + interest = gr.Textbox( + label="A natural language description of what you are interested in. We will generate relevancy scores (1-10) and explanations for the papers in the selected topics according to this statement.", + info="Press shift-enter or click the button below to update.", + lines=7, + value=config.get("interest", "") + ) + + with gr.Row(): + use_openai = gr.Checkbox(label="Use OpenAI", value=True) + use_gemini = gr.Checkbox(label="Use Gemini", value=False) + use_anthropic = gr.Checkbox(label="Use Claude", value=False) + + with gr.Accordion("Advanced Settings", open=False): + openai_model = gr.Dropdown(["gpt-3.5-turbo-16k", "gpt-4", "gpt-4-turbo", "gpt-4o", "gpt-4o-mini"], value="gpt-4", label="OpenAI Model") + gemini_model = gr.Dropdown(["gemini-1.5-flash", "gemini-1.5-pro", "gemini-2.0-flash"], value="gemini-2.0-flash", label="Gemini Model") + anthropic_model = gr.Dropdown(["claude-3-haiku-20240307", "claude-3-sonnet-20240229", "claude-3-opus-20240229", "claude-3.5-sonnet-20240620"], value="claude-3-sonnet-20240229", label="Claude Model") + + # Always include specialized analysis by default + special_analysis = gr.Checkbox(label="Include specialized analysis for research topics", value=True) + + # Add threshold slider for relevancy filtering + threshold = gr.Slider( + minimum=0, + maximum=10, + value=config.get("threshold", 2), + step=1, + label="Relevancy Score Threshold", + info="Papers with scores below this value will be filtered out (default from config.yaml: " + str(config.get("threshold", 2)) + ")" + ) + + # Hidden fields with fixed defaults (not shown in UI) + batch_size = gr.Number(value=0, visible=False) # 0 = process all + batch_number = gr.Number(value=1, visible=False) + prompt_batch_size = gr.Number(value=8, visible=False) # Fixed at 8 papers per prompt + + # Multi-stage processing info + gr.Markdown(""" + ### Two-Stage Paper Processing + 1. **Stage 1**: OpenAI performs quick relevancy filtering based on title & abstract only + 2. **Stage 2**: Gemini (if selected) performs detailed analysis on papers that passed threshold + """) + + # Add two-stage processing explanation + gr.Markdown(""" + **Efficiency Benefits**: + - Filtering happens before downloading full content + - Only relevant papers get detailed analysis + - Optimizes token usage and response time + """, visible=True) + + # Hidden fields for mechanistic interpretability and technical AI safety (not shown in UI but needed for function calls) + mechanistic_interpretability = gr.Checkbox(label="Include mechanistic interpretability analysis", value=False, visible=False) + technical_ai_safety = gr.Checkbox(label="Include technical AI safety analysis", value=False, visible=False) + + # Hidden fields for design automation (not shown in UI but needed for function calls) + design_automation = gr.Checkbox(label="Find graphic design automation papers", value=False, visible=False) + design_reference_paper = gr.Textbox( + label="Reference paper ID", + value="", + visible=False + ) + design_techniques = gr.CheckboxGroup( + choices=[], + value=[], + visible=False + ) + design_categories = gr.CheckboxGroup( + choices=[], + value=[], + visible=False + ) + + # Hidden fields for email (not shown in UI but needed for function calls) + email = gr.Textbox(label="Email address", type="email", placeholder="", visible=False) + sendgrid_token = gr.Textbox(label="SendGrid API Key", type="password", visible=False) + + with gr.Row(): + sample_btn = gr.Button("Generate Digest", variant="primary", scale=2) + + with gr.Row(): + sample_output = gr.Textbox( + label="Results", + info="Papers are first filtered by relevance score, then analyzed in depth. HTML reports are saved to the 'digest' folder.", + show_label=True + ) + + # Define all input fields + all_inputs = [ + email, subject, physics_subject, subsubject, interest, + use_openai, use_gemini, use_anthropic, + openai_model, gemini_model, anthropic_model, + special_analysis, threshold, batch_size, batch_number, prompt_batch_size, + mechanistic_interpretability, technical_ai_safety, + design_automation, design_reference_paper, design_techniques, design_categories + ] + + # Connect change handlers for dynamic UI - use cleaner event handling + def on_topic_change(topic): + visible = (topic == "Physics") + return { + physics_subject: gr.update(visible=visible), + subsubject: gr.update(choices=categories_map.get(topic, []), visible=topic in categories_map) + } + + subject.change(fn=on_topic_change, inputs=[subject], outputs=[physics_subject, subsubject]) + + # Use simpler event handler for physics subtopic changing + def on_physics_change(topic, physics_topic): + if topic == "Physics" and physics_topic and physics_topic in categories_map: + return gr.update(choices=categories_map[physics_topic], visible=True) + return gr.update(visible=False) + + physics_subject.change(fn=on_physics_change, inputs=[subject, physics_subject], outputs=[subsubject]) + + # Sample button + sample_btn.click( + fn=sample, + inputs=all_inputs, + outputs=sample_output + ) + + # Register API keys + openai_token.change(fn=register_openai_token, inputs=[openai_token]) + gemini_token.change(fn=register_gemini_token, inputs=[gemini_token]) + anthropic_token.change(fn=register_anthropic_token, inputs=[anthropic_token]) + + # Only allow updates when the button is clicked or interest is submitted directly + interest.submit(fn=sample, inputs=all_inputs, outputs=sample_output) + +demo.launch(show_api=False) diff --git a/src/design/README.md b/src/design/README.md new file mode 100644 index 0000000..f90399b --- /dev/null +++ b/src/design/README.md @@ -0,0 +1,82 @@ +# 🎨 Design Paper Discovery + +This module specializes in finding and analyzing papers related to AI/ML for design automation. It crawls arXiv for design-related papers and provides detailed reports on recent research at the intersection of AI and design. + +## Features + +- **Smart Paper Finding**: Automatically finds papers related to design automation and creative AI +- **Multi-Category Search**: Searches across Computer Vision, Graphics, HCI, and other relevant arXiv categories +- **Intelligent Categorization**: Sorts papers into design subcategories (UI/UX, Layout, Graphic Design, etc.) +- **Technique Analysis**: Identifies AI techniques used (GANs, Diffusion Models, LLMs, etc.) +- **LLM-Powered Analysis**: Optional in-depth analysis using OpenAI, Gemini, or Claude models +- **HTML Reports**: Generates clean, organized HTML reports with paper statistics and details +- **JSON Export**: Saves all paper data in structured JSON format for further processing + +## Quick Start + +Run the main script from the project root directory: + +```bash +# Basic usage - find design papers from the last 7 days +./src/design/find_design_papers.sh + +# With keyword filtering - find design papers about layout generation +./src/design/find_design_papers.sh --keyword "layout" + +# With longer timeframe - find design papers from the last month +./src/design/find_design_papers.sh --days 30 +``` + +## Advanced Usage + +```bash +# With LLM analysis for comprehensive paper details +./src/design/find_design_papers.sh --analyze + +# Customize research interests for analysis +./src/design/find_design_papers.sh --analyze --interest "I'm looking for papers on UI/UX automation and layout generation with neural networks" + +# Change the model used for analysis +./src/design/find_design_papers.sh --analyze --model "gpt-4o" + +# Combined example with all major features +./src/design/find_design_papers.sh --days 14 --keyword "diffusion" --analyze --model "gpt-4o" --interest "I'm researching diffusion models for design applications" + +# Output files include the current date by default: +# - data/design_papers_diffusion_20250406.json +# - digest/design_papers_diffusion_20250406.html + +# Disable date in filenames if needed +./src/design/find_design_papers.sh --keyword "layout" --no-date +``` + +## Parameters Reference + +| Parameter | Description | Default | +|-----------|-------------|---------| +| `--days N` | Number of days to search back | 7 | +| `--keyword TERM` | Filter papers containing this keyword | none | +| `--analyze` | Use LLM to perform detailed analysis | false | +| `--interest "TEXT"` | Custom research interest for LLM | Design automation focus | +| `--model MODEL` | Model to use for analysis | gpt-3.5-turbo-16k | +| `--no-date` | Don't add date to output filenames | false | +| `--output FILE` | Custom JSON output path | data/design_papers_DATE.json | +| `--html FILE` | Custom HTML output path | digest/design_papers_DATE.html | +| `--help` | Show help message | | + +## Implementation Details + +The design paper discovery consists of these main components: + +1. **find_design_papers.sh**: Main shell script interface with help and options +2. **find_design_papers.py**: Core Python implementation for arXiv discovery and analysis +3. **design_finder.py**: Alternative implementation with minimal dependencies +4. **get_design_papers.sh**: Legacy script (maintained for backward compatibility) + +## Example Output + +The HTML report includes: +- Summary statistics and paper counts by category and technique +- Detailed paper listings with titles, authors, and abstracts +- AI analysis sections when using the `--analyze` flag +- Links to arXiv pages and PDF downloads diff --git a/src/design/design_finder.py b/src/design/design_finder.py new file mode 100755 index 0000000..196d142 --- /dev/null +++ b/src/design/design_finder.py @@ -0,0 +1,439 @@ +#!/usr/bin/env python3 +""" +Design Finder - A self-contained script to find AI/ML design automation papers on arXiv. + +This script requires only Python standard libraries and BeautifulSoup, making it very easy to run +without complex dependencies. + +Usage: + python design_finder.py [--days 7] [--output design_papers.json] +""" + +import os +import sys +import json +import argparse +import datetime +import re +import time +import urllib.request +from typing import List, Dict, Any + +# Check for BeautifulSoup +try: + from bs4 import BeautifulSoup as bs +except ImportError: + print("BeautifulSoup not found. Installing...") + import subprocess + subprocess.check_call([sys.executable, "-m", "pip", "install", "beautifulsoup4"]) + from bs4 import BeautifulSoup as bs + +# Default arXiv categories to search +DEFAULT_CATEGORIES = [ + "cs.CV", # Computer Vision + "cs.GR", # Graphics + "cs.HC", # Human-Computer Interaction + "cs.AI", # Artificial Intelligence + "cs.LG", # Machine Learning + "cs.CL", # Computation and Language (NLP) + "cs.MM" # Multimedia +] + +# Design automation keywords for paper filtering +DESIGN_AUTOMATION_KEYWORDS = [ + "design automation", "layout generation", "visual design", "graphic design", + "creative AI", "generative design", "UI generation", "UX automation", + "design system", "composition", "creative workflow", "automated design", + "design tool", "design assistant", "design optimization", "content-aware", + "user interface generation", "visual layout", "image composition", "AI design" +] + +class DesignPaperFinder: + def __init__(self, days_back=7, categories=None, output_file="design_papers.json", + html_file="design_papers.html", keyword=None, verbose=True): + self.days_back = days_back + self.categories = categories or DEFAULT_CATEGORIES + self.output_file = output_file + self.html_file = html_file + self.keyword = keyword + self.verbose = verbose + self.papers = [] + + # Data directory is already created by paths.py module + + def log(self, message): + """Print a message if verbose mode is enabled.""" + if self.verbose: + print(message) + + def get_date_range(self) -> List[str]: + """Get list of dates to search in arXiv format.""" + today = datetime.datetime.now() + dates = [] + + for i in range(self.days_back): + date = today - datetime.timedelta(days=i) + date_str = date.strftime("%a, %d %b %y") + dates.append(date_str) + + return dates + + def download_papers(self, category: str, date_str: str) -> List[Dict[str, Any]]: + """Download papers for a specific category and date.""" + # Check if we already have this data + # Import data directory at runtime to avoid circular imports + import sys + sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) + from paths import DATA_DIR + file_path = os.path.join(DATA_DIR, f"{category}_{date_str}.jsonl") + if os.path.exists(file_path): + self.log(f"Loading cached papers for {category} on {date_str}") + papers = [] + with open(file_path, "r") as f: + for line in f: + papers.append(json.loads(line)) + return papers + + # Download new papers + self.log(f"Downloading papers for {category} on {date_str}") + NEW_SUB_URL = f'https://arxiv.org/list/{category}/new' + + try: + page = urllib.request.urlopen(NEW_SUB_URL) + except Exception as e: + self.log(f"Error downloading from {NEW_SUB_URL}: {e}") + return [] + + soup = bs(page, 'html.parser') + content = soup.body.find("div", {'id': 'content'}) + + # Find the date heading + h3 = content.find("h3").text # e.g: New submissions for Wed, 10 May 23 + date_from_page = h3.replace("New submissions for", "").strip() + + # Find all papers + dt_list = content.dl.find_all("dt") + dd_list = content.dl.find_all("dd") + arxiv_base = "https://arxiv.org/abs/" + arxiv_html = "https://arxiv.org/html/" + + papers = [] + for i in range(len(dt_list)): + try: + paper = {} + ahref = dt_list[i].find('a', href=re.compile(r'[/]([a-z]|[A-Z])\w+')).attrs['href'] + paper_number = ahref.strip().replace("/abs/", "") + + paper['main_page'] = arxiv_base + paper_number + paper['pdf'] = arxiv_base.replace('abs', 'pdf') + paper_number + + paper['title'] = dd_list[i].find("div", {"class": "list-title mathjax"}).text.replace("Title:\n", "").strip() + paper['authors'] = dd_list[i].find("div", {"class": "list-authors"}).text.replace("Authors:\n", "").replace("\n", "").strip() + paper['subjects'] = dd_list[i].find("div", {"class": "list-subjects"}).text.replace("Subjects:\n", "").strip() + paper['abstract'] = dd_list[i].find("p", {"class": "mathjax"}).text.replace("\n", " ").strip() + + # Get a short excerpt of content (optional) + try: + html = urllib.request.urlopen(arxiv_html + paper_number + "v1") + soup_content = bs(html, 'html.parser') + content_div = soup_content.find('div', attrs={'class': 'ltx_page_content'}) + if content_div: + para_list = content_div.find_all("div", attrs={'class': 'ltx_para'}) + excerpt = ' '.join([p.text.strip() for p in para_list[:3]]) # Get first 3 paragraphs + paper['content_excerpt'] = excerpt[:1000] + "..." if len(excerpt) > 1000 else excerpt + else: + paper['content_excerpt'] = "Content not available" + except Exception: + paper['content_excerpt'] = "" + + papers.append(paper) + except Exception as e: + if self.verbose: + self.log(f"Error processing paper {i}: {e}") + + # Save papers to file + with open(file_path, "w") as f: + for paper in papers: + f.write(json.dumps(paper) + "\n") + + return papers + + def is_design_automation_paper(self, paper: Dict[str, Any]) -> bool: + """Check if a paper is related to design automation based on keywords.""" + text = ( + (paper.get("title", "") + " " + + paper.get("abstract", "") + " " + + paper.get("subjects", "")).lower() + ) + + return any(keyword.lower() in text for keyword in DESIGN_AUTOMATION_KEYWORDS) + + def categorize_design_paper(self, paper: Dict[str, Any]) -> str: + """Categorize design automation paper into subcategories.""" + text = (paper.get("title", "") + " " + paper.get("abstract", "")).lower() + + categories = { + "Layout Generation": ["layout", "composition", "arrange", "grid"], + "UI/UX Design": ["user interface", "ui", "ux", "interface design", "website"], + "Graphic Design": ["graphic design", "poster", "visual design", "typography"], + "Image Manipulation": ["image editing", "photo", "manipulation", "style transfer"], + "Design Tools": ["tool", "assistant", "workflow", "productivity"], + "3D Design": ["3d", "modeling", "cad", "product design"], + "Multimodal Design": ["multimodal", "text-to-image", "image-to-code"] + } + + matches = [] + for category, keywords in categories.items(): + if any(keyword in text for keyword in keywords): + matches.append(category) + + if matches: + return ", ".join(matches) + return "General Design Automation" + + def analyze_design_techniques(self, paper: Dict[str, Any]) -> List[str]: + """Extract AI/ML techniques used for design automation in the paper.""" + text = (paper.get("title", "") + " " + paper.get("abstract", "")).lower() + + techniques = [] + technique_keywords = { + "Generative Adversarial Networks": ["gan", "generative adversarial"], + "Diffusion Models": ["diffusion", "ddpm", "stable diffusion"], + "Transformers": ["transformer", "attention mechanism"], + "Reinforcement Learning": ["reinforcement learning", "rl"], + "Computer Vision": ["computer vision", "vision", "cnn"], + "Graph Neural Networks": ["graph neural", "gnn"], + "Large Language Models": ["llm", "large language model", "gpt", "chatgpt"], + "Neural Style Transfer": ["style transfer", "neural style"], + "Evolutionary Algorithms": ["genetic algorithm", "evolutionary"] + } + + for technique, keywords in technique_keywords.items(): + if any(keyword in text for keyword in keywords): + techniques.append(technique) + + return techniques + + def find_papers(self): + """Find design automation papers from arXiv.""" + self.log(f"Looking for design papers in the past {self.days_back} days") + self.log(f"Searching categories: {', '.join(self.categories)}") + + # Get papers for each category and date + dates = self.get_date_range() + all_papers = [] + + for category in self.categories: + for date_str in dates: + try: + papers = self.download_papers(category, date_str) + all_papers.extend(papers) + # Avoid hitting arXiv rate limits + time.sleep(3) + except Exception as e: + self.log(f"Error downloading papers for {category} on {date_str}: {e}") + + # Remove duplicates (papers can appear in multiple categories) + unique_papers = {} + for paper in all_papers: + paper_id = paper.get("main_page", "").split("/")[-1] + if paper_id and paper_id not in unique_papers: + unique_papers[paper_id] = paper + + all_papers = list(unique_papers.values()) + + # Filter for design automation papers + design_papers = [] + for paper in all_papers: + if self.is_design_automation_paper(paper): + paper["design_category"] = self.categorize_design_paper(paper) + paper["design_techniques"] = self.analyze_design_techniques(paper) + design_papers.append(paper) + + # Additional keyword filtering if specified + if self.keyword: + keyword = self.keyword.lower() + design_papers = [ + p for p in design_papers + if keyword in p.get("title", "").lower() or + keyword in p.get("abstract", "").lower() + ] + + # Sort by date + design_papers.sort(key=lambda p: p.get("main_page", ""), reverse=True) + + self.papers = design_papers + self.log(f"Found {len(design_papers)} design automation papers") + return design_papers + + def print_paper_summary(self, paper: Dict[str, Any]): + """Print a nice summary of a paper to the console.""" + print(f"\n{'=' * 80}") + print(f"TITLE: {paper.get('title', 'No title')}") + print(f"AUTHORS: {paper.get('authors', 'No authors')}") + print(f"URL: {paper.get('main_page', 'No URL')}") + print(f"DESIGN CATEGORY: {paper.get('design_category', 'Unknown')}") + print(f"TECHNIQUES: {', '.join(paper.get('design_techniques', []))}") + print(f"\nABSTRACT: {paper.get('abstract', 'No abstract')[:500]}...") + print(f"{'=' * 80}\n") + + def generate_html_report(self): + """Generate an HTML report from papers.""" + if not self.papers: + self.log("No papers to generate HTML report from") + return + + html = f""" + + + + + Design Automation Papers + + + +

Design Automation Papers

+ +
+

Found {len(self.papers)} papers related to graphic design automation with AI/ML

+

Generated on {datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")}

+

Keywords: {', '.join(DESIGN_AUTOMATION_KEYWORDS[:5])}...

+
+ """ + + # Count categories and techniques + categories = {} + techniques = {} + + for paper in self.papers: + category = paper.get("design_category", "Uncategorized") + if category in categories: + categories[category] += 1 + else: + categories[category] = 1 + + for technique in paper.get("design_techniques", []): + if technique in techniques: + techniques[technique] += 1 + else: + techniques[technique] = 1 + + # Add summary statistics + html += '

Summary Statistics

' + + html += "

Categories:

" + + html += "

Techniques:

" + + # Add papers + html += '

Papers

' + for paper in self.papers: + html += f""" +
+
{paper.get("title", "No title")}
+
{paper.get("authors", "Unknown authors")}
+
Category: {paper.get("design_category", "General")} | Subject: {paper.get("subjects", "N/A")}
+
Techniques: {', '.join(paper.get("design_techniques", ["None identified"]))}
+
Abstract: {paper.get("abstract", "No abstract available")}
+
+ PDF | + arXiv +
+
+ """ + + html += """ + + + + """ + + with open(self.html_file, "w") as f: + f.write(html) + + self.log(f"HTML report generated: {self.html_file}") + + def save_json(self): + """Save papers to JSON file.""" + if not self.papers: + self.log("No papers to save") + return + + with open(self.output_file, "w") as f: + json.dump(self.papers, f, indent=2) + + self.log(f"Saved {len(self.papers)} papers to {self.output_file}") + + def run(self): + """Run the full paper finding process.""" + self.find_papers() + + if not self.papers: + print("No design automation papers found.") + return + + # Print summary of top papers + for paper in self.papers[:10]: # Print top 10 + self.print_paper_summary(paper) + + if len(self.papers) > 10: + print(f"...and {len(self.papers) - 10} more papers.") + + # Save outputs + self.save_json() + self.generate_html_report() + + print(f"\nResults saved to {self.output_file} and {self.html_file}") + print(f"Open {self.html_file} in your browser to view the report.") + +def main(): + parser = argparse.ArgumentParser(description="Find the latest graphic design automation papers.") + parser.add_argument("--days", type=int, default=7, help="Number of days to look back") + parser.add_argument("--output", type=str, default="design_papers.json", help="Output file path") + parser.add_argument("--html", type=str, default="design_papers.html", help="HTML output file path") + parser.add_argument("--categories", type=str, nargs="+", default=DEFAULT_CATEGORIES, + help="arXiv categories to search") + parser.add_argument("--keyword", type=str, help="Additional keyword to filter papers") + parser.add_argument("--quiet", action="store_true", help="Suppress progress messages") + args = parser.parse_args() + + finder = DesignPaperFinder( + days_back=args.days, + categories=args.categories, + output_file=args.output, + html_file=args.html, + keyword=args.keyword, + verbose=not args.quiet + ) + + finder.run() + +if __name__ == "__main__": + main() \ No newline at end of file diff --git a/src/design/find_design_papers.py b/src/design/find_design_papers.py new file mode 100755 index 0000000..4ca25fe --- /dev/null +++ b/src/design/find_design_papers.py @@ -0,0 +1,624 @@ +#!/usr/bin/env python3 +""" +Standalone Design Papers Crawler - A simple script to find the latest papers +on graphic design automation using AI/ML/LLM technologies. + +This version has minimal dependencies and doesn't require the full model setup. + +Usage: + python find_design_papers.py [--days 7] [--output design_papers.json] +""" + +import os +import sys +import json +import argparse +import datetime +import logging +import re +import urllib.request +import time +from typing import List, Dict, Any, Optional, Tuple +from bs4 import BeautifulSoup as bs + +# Add parent directory to path to allow imports from sibling modules +sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) +from paths import DATA_DIR, DIGEST_DIR +from model_manager import model_manager, ModelProvider + +# Configure logging +logging.basicConfig( + level=logging.INFO, + format='%(asctime)s - %(name)s - %(levelname)s - %(message)s' +) +logger = logging.getLogger(__name__) + +# Default arXiv categories to search +DEFAULT_CATEGORIES = [ + "cs.CV", # Computer Vision + "cs.GR", # Graphics + "cs.HC", # Human-Computer Interaction + "cs.AI", # Artificial Intelligence + "cs.LG", # Machine Learning + "cs.CL", # Computation and Language (NLP) + "cs.MM" # Multimedia +] + +# Design automation keywords for paper filtering +DESIGN_AUTOMATION_KEYWORDS = [ + "design automation", "layout generation", "visual design", "graphic design", + "creative AI", "generative design", "UI generation", "UX automation", + "design system", "composition", "creative workflow", "automated design", + "design tool", "design assistant", "design optimization", "content-aware", + "user interface generation", "visual layout", "image composition" +] + +def download_papers(category: str, date_str: str = None) -> List[Dict[str, Any]]: + """ + Download papers for a specific category and date. + + Args: + category: arXiv category code + date_str: Date string in arXiv format (default: today) + + Returns: + List of paper dictionaries + """ + if not date_str: + date = datetime.datetime.now() + date_str = date.strftime("%a, %d %b %y") + + # Data directory is already created by paths.py + pass + + # Check if we already have this data + file_path = os.path.join(DATA_DIR, f"{category}_{date_str}.jsonl") + if os.path.exists(file_path): + papers = [] + with open(file_path, "r") as f: + for line in f: + papers.append(json.loads(line)) + return papers + + # Download new papers + logger.info(f"Downloading papers for {category} on {date_str}") + NEW_SUB_URL = f'https://arxiv.org/list/{category}/new' + + try: + # Add user-agent header to appear more like a browser + headers = { + 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36' + } + req = urllib.request.Request(NEW_SUB_URL, headers=headers) + page = urllib.request.urlopen(req) + except Exception as e: + logger.error(f"Error downloading from {NEW_SUB_URL}: {e}") + return [] + + soup = bs(page, 'html.parser') + content = soup.body.find("div", {'id': 'content'}) + + # Find the date heading + h3 = content.find("h3").text # e.g: New submissions for Wed, 10 May 23 + date_from_page = h3.replace("New submissions for", "").strip() + + # Find all papers + dt_list = content.dl.find_all("dt") + dd_list = content.dl.find_all("dd") + arxiv_base = "https://arxiv.org/abs/" + arxiv_html = "https://arxiv.org/html/" + + papers = [] + for i in range(len(dt_list)): + try: + paper = {} + ahref = dt_list[i].find('a', href=re.compile(r'[/]([a-z]|[A-Z])\w+')).attrs['href'] + paper_number = ahref.strip().replace("/abs/", "") + + paper['main_page'] = arxiv_base + paper_number + paper['pdf'] = arxiv_base.replace('abs', 'pdf') + paper_number + + paper['title'] = dd_list[i].find("div", {"class": "list-title mathjax"}).text.replace("Title:\n", "").strip() + paper['authors'] = dd_list[i].find("div", {"class": "list-authors"}).text.replace("Authors:\n", "").replace("\n", "").strip() + paper['subjects'] = dd_list[i].find("div", {"class": "list-subjects"}).text.replace("Subjects:\n", "").strip() + paper['abstract'] = dd_list[i].find("p", {"class": "mathjax"}).text.replace("\n", " ").strip() + + # Get a short excerpt of content (optional) + try: + # Add user-agent header to appear more like a browser + headers = { + 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36' + } + req = urllib.request.Request(arxiv_html + paper_number + "v1", headers=headers) + html = urllib.request.urlopen(req) + soup_content = bs(html, 'html.parser') + content_div = soup_content.find('div', attrs={'class': 'ltx_page_content'}) + if content_div: + para_list = content_div.find_all("div", attrs={'class': 'ltx_para'}) + excerpt = ' '.join([p.text.strip() for p in para_list[:3]]) # Get first 3 paragraphs + paper['content_excerpt'] = excerpt[:1000] + "..." if len(excerpt) > 1000 else excerpt + else: + paper['content_excerpt'] = "Content not available" + except Exception as e: + paper['content_excerpt'] = f"Error extracting content: {str(e)}" + + papers.append(paper) + except Exception as e: + logger.warning(f"Error processing paper {i}: {e}") + + # Save papers to file + with open(file_path, "w") as f: + for paper in papers: + f.write(json.dumps(paper) + "\n") + + return papers + +def is_design_automation_paper(paper: Dict[str, Any]) -> bool: + """ + Check if a paper is related to design automation based on keywords. + + Args: + paper: Dictionary with paper details + + Returns: + Boolean indicating if paper is related to design automation + """ + text = ( + (paper.get("title", "") + " " + + paper.get("abstract", "") + " " + + paper.get("subjects", "")).lower() + ) + + return any(keyword.lower() in text for keyword in DESIGN_AUTOMATION_KEYWORDS) + +def categorize_design_paper(paper: Dict[str, Any]) -> str: + """ + Categorize design automation paper into subcategories. + + Args: + paper: Dictionary with paper details + + Returns: + Category name string + """ + text = (paper.get("title", "") + " " + paper.get("abstract", "")).lower() + + categories = { + "Layout Generation": ["layout", "composition", "arrange", "grid"], + "UI/UX Design": ["user interface", "ui", "ux", "interface design", "website"], + "Graphic Design": ["graphic design", "poster", "visual design", "typography"], + "Image Manipulation": ["image editing", "photo", "manipulation", "style transfer"], + "Design Tools": ["tool", "assistant", "workflow", "productivity"], + "3D Design": ["3d", "modeling", "cad", "product design"], + "Multimodal Design": ["multimodal", "text-to-image", "image-to-code"] + } + + matches = [] + for category, keywords in categories.items(): + if any(keyword in text for keyword in keywords): + matches.append(category) + + if matches: + return ", ".join(matches) + return "General Design Automation" + +def analyze_design_techniques(paper: Dict[str, Any]) -> List[str]: + """ + Extract AI/ML techniques used for design automation in the paper. + + Args: + paper: Dictionary with paper details + + Returns: + List of techniques + """ + text = (paper.get("title", "") + " " + paper.get("abstract", "")).lower() + + techniques = [] + technique_keywords = { + "Generative Adversarial Networks": ["gan", "generative adversarial"], + "Diffusion Models": ["diffusion", "ddpm", "stable diffusion"], + "Transformers": ["transformer", "attention mechanism"], + "Reinforcement Learning": ["reinforcement learning", "rl"], + "Computer Vision": ["computer vision", "vision", "cnn"], + "Graph Neural Networks": ["graph neural", "gnn"], + "Large Language Models": ["llm", "large language model", "gpt"], + "Neural Style Transfer": ["style transfer", "neural style"], + "Evolutionary Algorithms": ["genetic algorithm", "evolutionary"] + } + + for technique, keywords in technique_keywords.items(): + if any(keyword in text for keyword in keywords): + techniques.append(technique) + + return techniques + +def get_date_range(days_back: int = 7) -> List[str]: + """ + Get a list of dates for the past N days in arXiv format. + + Args: + days_back: Number of days to look back + + Returns: + List of date strings in arXiv format + """ + today = datetime.datetime.now() + dates = [] + + for i in range(days_back): + date = today - datetime.timedelta(days=i) + date_str = date.strftime("%a, %d %b %y") + dates.append(date_str) + + return dates + +def generate_html_report(papers: List[Dict[str, Any]], output_file: str, keyword: str = None, days_back: int = 7) -> None: + """ + Generate an HTML report from papers. + + Args: + papers: List of paper dictionaries + output_file: Path to output HTML file + keyword: Optional keyword used for filtering + days_back: Number of days searched + """ + # Ensure the output directory exists + output_dir = os.path.dirname(output_file) + if output_dir and not os.path.exists(output_dir): + os.makedirs(output_dir, exist_ok=True) + + # Create a title that includes any keywords and date + title_date = datetime.datetime.now().strftime("%B %d, %Y") + page_title = "Design Automation Papers" + if keyword: + page_title = f"Design Automation Papers - {keyword.title()} - {title_date}" + else: + page_title = f"Design Automation Papers - {title_date}" + + html = f""" + + + + + {page_title} + + + +

Design Automation Papers

+
+

Found {len(papers)} papers related to graphic design automation with AI/ML

+

Generated on {datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")}

+
+ """ + + # Count categories and techniques + categories = {} + techniques = {} + + for paper in papers: + category = paper.get("design_category", "Uncategorized") + if category in categories: + categories[category] += 1 + else: + categories[category] = 1 + + for technique in paper.get("design_techniques", []): + if technique in techniques: + techniques[technique] += 1 + else: + techniques[technique] = 1 + + # Add summary statistics + html += "

Summary Statistics

" + + html += "

Categories:

" + + html += "

Techniques:

" + + # Add papers + for paper in papers: + html += f""" +
+
{paper.get("title", "No title")}
+
{paper.get("authors", "Unknown authors")}
+
Category: {paper.get("design_category", "General")} | Subject: {paper.get("subjects", "N/A")}
+
Techniques: {', '.join(paper.get("design_techniques", ["None identified"]))}
+ """ + + # Add relevancy score and reasons if available + if "Relevancy score" in paper: + html += f'
Relevancy Score: {paper.get("Relevancy score", "N/A")}
' + + if "Reasons for match" in paper: + html += f'
Reason: {paper.get("Reasons for match", "")}
' + + # Add abstract + if "abstract" in paper: + html += f'
Abstract: {paper.get("abstract", "")}
' + + # Add all the additional analysis sections + for key, value in paper.items(): + if key in ["title", "authors", "subjects", "main_page", "Relevancy score", "Reasons for match", + "design_category", "design_techniques", "content", "abstract"]: + continue + + if isinstance(value, str) and value.strip(): + html += f'
{key}:
{value}
' + + # Add links + html += f""" + +
+ """ + + html += f""" + + + + """ + + with open(output_file, "w") as f: + f.write(html) + + logger.info(f"HTML report generated: {output_file}") + +def print_paper_summary(paper: Dict[str, Any]) -> None: + """ + Print a nice summary of a paper to the console. + + Args: + paper: Paper dictionary + """ + print(f"\n{'=' * 80}") + print(f"TITLE: {paper.get('title', 'No title')}") + print(f"AUTHORS: {paper.get('authors', 'No authors')}") + print(f"URL: {paper.get('main_page', 'No URL')}") + print(f"DESIGN CATEGORY: {paper.get('design_category', 'Unknown')}") + print(f"TECHNIQUES: {', '.join(paper.get('design_techniques', []))}") + print(f"\nABSTRACT: {paper.get('abstract', 'No abstract')[:500]}...") + print(f"{'=' * 80}\n") + +def analyze_papers_with_llm(papers: List[Dict[str, Any]], research_interest: str) -> List[Dict[str, Any]]: + """ + Analyze papers using LLM to provide detailed analysis + + Args: + papers: List of paper dictionaries + research_interest: Description of research interests + + Returns: + Enhanced list of papers with detailed analysis + """ + if not papers: + return papers + + # Check if model_manager is properly initialized + if not model_manager.is_provider_available(ModelProvider.OPENAI): + # Try to get OpenAI key from environment + import os + openai_key = os.environ.get("OPENAI_API_KEY") + if openai_key: + model_manager.register_openai(openai_key) + else: + logger.warning("No OpenAI API key available. Skipping detailed analysis.") + return papers + + logger.info(f"Analyzing {len(papers)} papers with LLM...") + + # Default research interest for design papers if none provided + if not research_interest: + research_interest = """ + I'm interested in papers that use AI/ML for design automation, including: + 1. Generative design systems for graphics, UI/UX, and layouts + 2. ML-enhanced creative tools and design assistants + 3. Novel techniques for automating design processes + 4. Human-AI collaborative design workflows + 5. Applications of LLMs, diffusion models, and GANs to design tasks + """ + + # Analyze papers using model_manager + try: + analyzed_papers, _ = model_manager.analyze_papers( + papers, + query={"interest": research_interest}, + providers=[ModelProvider.OPENAI], + model_names={ModelProvider.OPENAI: "gpt-3.5-turbo-16k"}, + threshold_score=0 # Include all papers, even low scored ones + ) + return analyzed_papers + except Exception as e: + logger.error(f"Error during LLM analysis: {e}") + return papers + +def pre_filter_category(category: str, keyword: str = None) -> bool: + """ + Check if a category is likely to contain design-related papers + to avoid downloading irrelevant categories. + + Args: + category: arXiv category code + keyword: Optional search keyword + + Returns: + Boolean indicating whether to include this category + """ + # Always include these categories as they're highly relevant + high_relevance = ["cs.GR", "cs.HC", "cs.CV", "cs.MM", "cs.SD"] + + if category in high_relevance: + return True + + # If we have a keyword, we need to be less strict to avoid missing papers + if keyword: + return True + + # Medium relevance categories - include for comprehensive searches + medium_relevance = ["cs.AI", "cs.LG", "cs.CL", "cs.RO", "cs.CY"] + return category in medium_relevance + +def main(): + parser = argparse.ArgumentParser(description="Find the latest graphic design automation papers.") + parser.add_argument("--days", type=int, default=7, help="Number of days to look back") + parser.add_argument("--output", type=str, help="Output JSON file path (date will be added automatically)") + parser.add_argument("--html", type=str, help="HTML output file path (date will be added automatically)") + parser.add_argument("--categories", type=str, nargs="+", default=DEFAULT_CATEGORIES, + help="arXiv categories to search") + parser.add_argument("--keyword", type=str, help="Additional keyword to filter papers") + parser.add_argument("--analyze", action="store_true", help="Use LLM to perform detailed analysis of papers") + parser.add_argument("--interest", type=str, help="Research interest description for LLM analysis") + parser.add_argument("--model", type=str, default="gpt-3.5-turbo-16k", help="Model to use for analysis") + parser.add_argument("--no-date", action="store_true", help="Disable adding date to filenames") + args = parser.parse_args() + + # Generate date string for filenames + current_date = datetime.datetime.now().strftime("%Y%m%d") + + # Set default filenames with dates if not provided + if args.output is None: + base_filename = "design_papers" + if args.keyword: + # Add keyword to filename if provided + base_filename = f"design_papers_{args.keyword.lower().replace(' ', '_')}" + + if not args.no_date: + args.output = os.path.join(DATA_DIR, f"{base_filename}_{current_date}.json") + else: + args.output = os.path.join(DATA_DIR, f"{base_filename}.json") + + if args.html is None: + base_filename = "design_papers" + if args.keyword: + # Add keyword to filename if provided + base_filename = f"design_papers_{args.keyword.lower().replace(' ', '_')}" + + if not args.no_date: + args.html = os.path.join(DIGEST_DIR, f"{base_filename}_{current_date}.html") + else: + args.html = os.path.join(DIGEST_DIR, f"{base_filename}.html") + + logger.info(f"Looking for design papers in the past {args.days} days") + + # Apply pre-filtering to categories + filtered_categories = [cat for cat in args.categories if pre_filter_category(cat, args.keyword)] + logger.info(f"Pre-filtered categories: {', '.join(filtered_categories)}") + + # Get papers for each category and date + dates = get_date_range(args.days) + all_papers = [] + + for category in filtered_categories: + for date_str in dates: + try: + papers = download_papers(category, date_str) + # Apply keyword filter immediately if provided + if args.keyword: + keyword = args.keyword.lower() + papers = [ + p for p in papers + if keyword in p.get("title", "").lower() or + keyword in p.get("abstract", "").lower() or + keyword in p.get("subjects", "").lower() + ] + logger.info(f"Found {len(papers)} papers matching keyword '{args.keyword}' in {category}") + + all_papers.extend(papers) + # Avoid hitting arXiv rate limits + time.sleep(5) + except Exception as e: + logger.error(f"Error downloading papers for {category} on {date_str}: {e}") + + # Remove duplicates (papers can appear in multiple categories) + unique_papers = {} + for paper in all_papers: + paper_id = paper.get("main_page", "").split("/")[-1] + if paper_id and paper_id not in unique_papers: + unique_papers[paper_id] = paper + + all_papers = list(unique_papers.values()) + + # Filter for design automation papers + design_papers = [] + for paper in all_papers: + if is_design_automation_paper(paper): + paper["design_category"] = categorize_design_paper(paper) + paper["design_techniques"] = analyze_design_techniques(paper) + design_papers.append(paper) + + # Sort by date + design_papers.sort(key=lambda p: p.get("main_page", ""), reverse=True) + logger.info(f"Found {len(design_papers)} design automation papers") + + # Add detailed analysis with LLM if requested + if args.analyze and design_papers: + design_papers = analyze_papers_with_llm(design_papers, args.interest) + logger.info("Completed LLM analysis of papers") + + # Debug: Print out the analysis fields for the first paper + if design_papers: + logger.info(f"Paper analysis fields: {list(design_papers[0].keys())}") + # If 'Key innovations' is present, it confirms we have the detailed analysis + if 'Key innovations' in design_papers[0]: + logger.info("Detailed analysis fields present!") + else: + logger.warning("Detailed analysis fields missing!") + + # Print summary to console + for paper in design_papers[:10]: # Print top 10 + print_paper_summary(paper) + + if len(design_papers) > 10: + print(f"...and {len(design_papers) - 10} more papers.") + + # Ensure output directory exists + output_dir = os.path.dirname(args.output) + if output_dir and not os.path.exists(output_dir): + os.makedirs(output_dir, exist_ok=True) + + # Save to file + with open(args.output, "w") as f: + json.dump(design_papers, f, indent=2) + + # Generate HTML report + generate_html_report(design_papers, args.html, args.keyword, args.days) + + logger.info(f"Saved {len(design_papers)} papers to {args.output}") + print(f"\nResults saved to {args.output} and {args.html}") + + if args.analyze: + print("\nPapers have been analyzed with LLM for detailed information.") + print("The HTML report includes comprehensive analysis of each paper.") + +if __name__ == "__main__": + main() \ No newline at end of file diff --git a/src/design/find_design_papers.sh b/src/design/find_design_papers.sh new file mode 100755 index 0000000..7712327 --- /dev/null +++ b/src/design/find_design_papers.sh @@ -0,0 +1,42 @@ +#!/bin/bash +# Design papers finder script +# Searches arXiv for design automation papers and generates reports +# For full documentation, see ./README.md + +# Add help/usage function +show_help() { + echo "Usage: ./find_design_papers.sh [OPTIONS]" + echo "" + echo "Options:" + echo " --days N Search papers from the last N days (default: 7)" + echo " --keyword TERM Filter papers containing this keyword" + echo " --analyze Use LLM to perform detailed analysis of papers" + echo " --interest \"TEXT\" Custom research interest description for LLM" + echo " --model MODEL Model to use for analysis (default: gpt-3.5-turbo-16k)" + echo " --no-date Don't add date to output filenames" + echo " --output FILE Custom JSON output path (default: data/design_papers_DATE.json)" + echo " --html FILE Custom HTML output path (default: digest/design_papers_DATE.html)" + echo " --help Show this help message" + echo "" + echo "Examples:" + echo " ./find_design_papers.sh" + echo " ./find_design_papers.sh --keyword \"layout\" --days 14" + echo " ./find_design_papers.sh --analyze --interest \"UI/UX automation\"" +} + +# Show help if requested +if [[ "$1" == "--help" || "$1" == "-h" ]]; then + show_help + exit 0 +fi + +# Run the design papers finder with all arguments passed through +python -m src.design.find_design_papers "$@" + +# Show success message +if [ $? -eq 0 ]; then + echo "βœ“ Design papers finder completed successfully!" + echo " Open the HTML report in your browser to view results" +else + echo "βœ— Design papers finder encountered an error" +fi diff --git a/src/design/get_design_papers.sh b/src/design/get_design_papers.sh new file mode 100755 index 0000000..af804ae --- /dev/null +++ b/src/design/get_design_papers.sh @@ -0,0 +1,66 @@ +#!/bin/bash +# Legacy wrapper script for design papers finder - maintained for backward compatibility +# For new scripts, use find_design_papers.sh instead + +# Show deprecation warning +echo "⚠️ Warning: get_design_papers.sh is deprecated and will be removed in a future version" +echo "⚠️ Please use find_design_papers.sh instead, which has more features and better output" +echo "" + +# Default values +DAYS=7 +OUTPUT="design_papers.json" +KEYWORD="" +ANALYZE="" + +# Parse command-line arguments +while [[ $# -gt 0 ]]; do + case $1 in + --days) + DAYS="$2" + shift 2 + ;; + --output) + OUTPUT="$2" + shift 2 + ;; + --keyword) + KEYWORD="$2" + shift 2 + ;; + --analyze) + ANALYZE="--analyze" + shift + ;; + --email) + # Ignore email parameter - email functionality is removed + echo "Note: Email functionality has been removed. HTML report will be generated locally only." + shift 2 + ;; + *) + echo "Unknown option: $1" + exit 1 + ;; + esac +done + +# Run the crawler using the new script +echo "Searching for design papers from the last $DAYS days..." + +# Build the command +CMD="./src/design/find_design_papers.sh --days $DAYS --output ./data/$OUTPUT --html ./digest/${OUTPUT%.json}.html" + +# Add keyword if specified +if [ -n "$KEYWORD" ]; then + CMD="$CMD --keyword \"$KEYWORD\"" +fi + +# Add analyze if specified +if [ -n "$ANALYZE" ]; then + CMD="$CMD --analyze" +fi + +# Execute the command +eval $CMD + +echo "Done! View your results in ./digest/${OUTPUT%.json}.html" \ No newline at end of file diff --git a/src/design_automation.py b/src/design_automation.py new file mode 100644 index 0000000..c0eb51d --- /dev/null +++ b/src/design_automation.py @@ -0,0 +1,281 @@ +""" +Module for analyzing papers related to AI/ML for graphic design automation. +This module helps identify and analyze papers on automated design, layout generation, +creative AI tools, and related topics. +""" +import logging +import json +from typing import Dict, Any, List, Optional + +# Configure logging +logging.basicConfig(level=logging.INFO) +logger = logging.getLogger(__name__) + +# Design automation keywords for paper filtering +DESIGN_AUTOMATION_KEYWORDS = [ + "design automation", "layout generation", "visual design", "graphic design", + "creative AI", "generative design", "UI generation", "UX automation", + "design system", "composition", "creative workflow", "automated design", + "design tool", "design assistant", "design optimization", "content-aware", + "user interface generation", "visual layout", "image composition" +] + +DESIGN_AUTOMATION_PROMPT = """ +You are a specialized research assistant focused on AI/ML for graphic design automation. + +Analyze this paper from the perspective of AI for graphic design and creative automation: + +Title: {title} +Authors: {authors} +Abstract: {abstract} +Content: {content} + +Please provide a detailed analysis covering: + +1. Design automation focus: What aspect of design does this paper attempt to automate or enhance? +2. Technical approach: What AI/ML techniques are used in the paper for design automation? +3. Visual outputs: What kind of visual artifacts does the system generate? +4. Designer interaction: How does the system interact with human designers? +5. Data requirements: What data does the system use for training or operation? +6. Evaluation metrics: How is the system's design quality evaluated? +7. Real-world applicability: How practical is this approach for professional design workflows? +8. Novelty: What makes this approach unique compared to other design automation systems? +9. Limitations: What are the current limitations of this approach? +10. Future directions: What improvements or extensions are suggested? + +Format your response as JSON with these fields. +""" + +def is_design_automation_paper(paper: Dict[str, Any]) -> bool: + """ + Check if a paper is related to design automation based on keywords. + + Args: + paper: Dictionary with paper details + + Returns: + Boolean indicating if paper is related to design automation + """ + text = ( + (paper.get("title", "") + " " + + paper.get("abstract", "") + " " + + paper.get("subjects", "")).lower() + ) + + return any(keyword.lower() in text for keyword in DESIGN_AUTOMATION_KEYWORDS) + +def categorize_design_paper(paper: Dict[str, Any]) -> str: + """ + Categorize design automation paper into subcategories. + + Args: + paper: Dictionary with paper details + + Returns: + Category name string + """ + text = (paper.get("title", "") + " " + paper.get("abstract", "")).lower() + + categories = { + "Layout Generation": ["layout", "composition", "arrange", "grid"], + "UI/UX Design": ["user interface", "ui", "ux", "interface design", "website"], + "Graphic Design": ["graphic design", "poster", "visual design", "typography"], + "Image Manipulation": ["image editing", "photo", "manipulation", "style transfer"], + "Design Tools": ["tool", "assistant", "workflow", "productivity"], + "3D Design": ["3d", "modeling", "cad", "product design"], + "Multimodal Design": ["multimodal", "text-to-image", "image-to-code"] + } + + matches = [] + for category, keywords in categories.items(): + if any(keyword.lower() in text for keyword in keywords): + matches.append(category) + + if matches: + return ", ".join(matches) + return "General Design Automation" + +def analyze_design_techniques(paper: Dict[str, Any]) -> List[str]: + """ + Extract AI/ML techniques used for design automation in the paper. + + Args: + paper: Dictionary with paper details + + Returns: + List of techniques + """ + text = (paper.get("title", "") + " " + paper.get("abstract", "")).lower() + + techniques = [] + technique_keywords = { + "Generative Adversarial Networks": ["gan", "generative adversarial"], + "Diffusion Models": ["diffusion", "ddpm", "stable diffusion"], + "Transformers": ["transformer", "attention mechanism"], + "Reinforcement Learning": ["reinforcement learning", "rl"], + "Computer Vision": ["computer vision", "vision", "cnn"], + "Graph Neural Networks": ["graph neural", "gnn"], + "Large Language Models": ["llm", "large language model", "gpt"], + "Neural Style Transfer": ["style transfer", "neural style"], + "Evolutionary Algorithms": ["genetic algorithm", "evolutionary"] + } + + for technique, keywords in technique_keywords.items(): + if any(keyword in text for keyword in keywords): + techniques.append(technique) + + return techniques + +def extract_design_metrics(paper: Dict[str, Any]) -> List[str]: + """ + Extract evaluation metrics used for design quality assessment. + + Args: + paper: Dictionary with paper details + + Returns: + List of metrics + """ + text = (paper.get("title", "") + " " + paper.get("abstract", "")).lower() + + metrics = [] + metric_keywords = { + "User Studies": ["user study", "user evaluation", "human evaluation"], + "Aesthetic Measures": ["aesthetic", "beauty", "visual quality"], + "Design Principles": ["design principle", "balance", "harmony", "contrast"], + "Technical Metrics": ["fid", "inception score", "clip score", "psnr"], + "Efficiency Metrics": ["time", "speed", "efficiency"], + "Usability": ["usability", "user experience", "ux", "ease of use"] + } + + for metric, keywords in metric_keywords.items(): + if any(keyword in text for keyword in keywords): + metrics.append(metric) + + return metrics + +def get_related_design_papers(paper_id: str, papers: List[Dict[str, Any]]) -> List[Dict[str, Any]]: + """ + Find papers related to a specific design automation paper. + + Args: + paper_id: ID of the target paper + papers: List of paper dictionaries + + Returns: + List of related papers + """ + target_paper = next((p for p in papers if p.get("main_page", "").endswith(paper_id)), None) + if not target_paper: + return [] + + # Get techniques used in target paper + target_techniques = analyze_design_techniques(target_paper) + target_category = categorize_design_paper(target_paper) + + related_papers = [] + for paper in papers: + if paper.get("main_page", "") == target_paper.get("main_page", ""): + continue + + # Check if paper is on design automation + if not is_design_automation_paper(paper): + continue + + # Check if techniques or categories overlap + paper_techniques = analyze_design_techniques(paper) + paper_category = categorize_design_paper(paper) + + technique_overlap = len(set(target_techniques) & set(paper_techniques)) + category_match = paper_category == target_category + + if technique_overlap > 0 or category_match: + paper["relevance_reason"] = [] + + if technique_overlap > 0: + paper["relevance_reason"].append(f"Uses similar techniques: {', '.join(set(target_techniques) & set(paper_techniques))}") + + if category_match: + paper["relevance_reason"].append(f"Same design category: {paper_category}") + + paper["relevance_score"] = (technique_overlap * 2) + (2 if category_match else 0) + related_papers.append(paper) + + # Sort by relevance score + related_papers.sort(key=lambda x: x.get("relevance_score", 0), reverse=True) + return related_papers[:5] # Return top 5 related papers + +def create_design_analysis_prompt(paper: Dict[str, Any]) -> str: + """ + Create a prompt for analyzing a design automation paper. + + Args: + paper: Dictionary with paper details + + Returns: + Formatted prompt string + """ + return DESIGN_AUTOMATION_PROMPT.format( + title=paper.get("title", ""), + authors=paper.get("authors", ""), + abstract=paper.get("abstract", ""), + content=paper.get("content", "")[:10000] # Limit content length + ) + +def extract_design_capabilities(analysis: Dict[str, Any]) -> Dict[str, Any]: + """ + Extract specific design capabilities from an analysis. + + Args: + analysis: Dictionary with design paper analysis + + Returns: + Dictionary of design capabilities + """ + capabilities = {} + + # Extract design areas + if "Design automation focus" in analysis: + capabilities["design_areas"] = analysis["Design automation focus"] + + # Extract tools that could be replaced + tools = [] + tools_keywords = { + "Adobe Photoshop": ["photoshop", "photo editing", "image manipulation"], + "Adobe Illustrator": ["illustrator", "vector", "illustration"], + "Figma": ["figma", "ui design", "interface design"], + "Sketch": ["sketch", "ui design", "interface design"], + "InDesign": ["indesign", "layout", "publishing"], + "Canva": ["canva", "simple design", "templates"] + } + + for text_field in ["Technical approach", "Design automation focus", "Real-world applicability"]: + if text_field in analysis: + text = analysis[text_field].lower() + for tool, keywords in tools_keywords.items(): + if any(keyword in text for keyword in keywords): + tools.append(tool) + + capabilities["replaceable_tools"] = list(set(tools)) + + # Extract human-in-the-loop vs fully automated + if "Designer interaction" in analysis: + text = analysis["Designer interaction"].lower() + if "fully automated" in text or "automatic" in text or "without human" in text: + capabilities["automation_level"] = "Fully automated" + elif "human-in-the-loop" in text or "collaboration" in text or "assists" in text: + capabilities["automation_level"] = "Human-in-the-loop" + else: + capabilities["automation_level"] = "Hybrid" + + # Extract if it's ready for production + if "Real-world applicability" in analysis: + text = analysis["Real-world applicability"].lower() + if "production ready" in text or "commercially viable" in text or "can be used in real" in text: + capabilities["production_ready"] = True + elif "prototype" in text or "proof of concept" in text or "research" in text or "limitations" in text: + capabilities["production_ready"] = False + else: + capabilities["production_ready"] = "Unclear" + + return capabilities \ No newline at end of file diff --git a/src/design_finder/__init__.py b/src/design_finder/__init__.py new file mode 100644 index 0000000..d54a7b2 --- /dev/null +++ b/src/design_finder/__init__.py @@ -0,0 +1,3 @@ +""" +Design Finder module for finding AI/ML design automation papers on arXiv. +""" \ No newline at end of file diff --git a/src/design_finder/__main__.py b/src/design_finder/__main__.py new file mode 100644 index 0000000..4eaeee4 --- /dev/null +++ b/src/design_finder/__main__.py @@ -0,0 +1,7 @@ +""" +Entry point for design_finder module. +""" +from .main import main + +if __name__ == "__main__": + main() \ No newline at end of file diff --git a/src/design_finder/main.py b/src/design_finder/main.py new file mode 100644 index 0000000..ab6d7f6 --- /dev/null +++ b/src/design_finder/main.py @@ -0,0 +1,324 @@ +""" +Main module for design_finder. +Run with: python -m src.design_finder +""" +import os +import sys +import json +import argparse +import datetime +import logging +from typing import List, Dict, Any + +# Add parent directory to path to import from sibling modules +current_dir = os.path.dirname(os.path.abspath(__file__)) +parent_dir = os.path.dirname(os.path.dirname(current_dir)) +if parent_dir not in sys.path: + sys.path.append(parent_dir) + +from src.download_new_papers import get_papers, _download_new_papers +from src.design_automation import ( + is_design_automation_paper, + categorize_design_paper, + analyze_design_techniques, + extract_design_metrics +) +from src.paths import DATA_DIR, DIGEST_DIR + +# Configure logging +logging.basicConfig( + level=logging.INFO, + format='%(asctime)s - %(name)s - %(levelname)s - %(message)s' +) +logger = logging.getLogger(__name__) + +# Default arXiv categories to search +DEFAULT_CATEGORIES = [ + "cs.CV", # Computer Vision + "cs.GR", # Graphics + "cs.HC", # Human-Computer Interaction + "cs.AI", # Artificial Intelligence + "cs.LG", # Machine Learning + "cs.CL", # Computation and Language (NLP) + "cs.MM", # Multimedia + "cs.SD", # Sound + "cs.RO", # Robotics (for interactive design) + "cs.CY" # Computers and Society +] + +def get_date_range(days_back: int = 7) -> List[str]: + """ + Get a list of dates for the past N days in arXiv format. + + Args: + days_back: Number of days to look back + + Returns: + List of date strings in arXiv format + """ + today = datetime.datetime.now() + dates = [] + + for i in range(days_back): + date = today - datetime.timedelta(days=i) + date_str = date.strftime("%a, %d %b %y") + dates.append(date_str) + + return dates + +def ensure_data_files(categories: List[str], days_back: int = 7) -> None: + """ + Make sure data files exist for the specified categories and date range. + + Args: + categories: List of arXiv category codes + days_back: Number of days to look back + """ + dates = get_date_range(days_back) + + for category in categories: + for date_str in dates: + # Add a delay between requests to avoid being blocked + time.sleep(1) + file_path = os.path.join(DATA_DIR, f"{category}_{date_str}.jsonl") + + if not os.path.exists(file_path): + logger.info(f"Downloading papers for {category} on {date_str}") + try: + _download_new_papers(category) + except Exception as e: + logger.error(f"Error downloading {category} papers for {date_str}: {e}") + +def get_design_papers(categories: List[str], days_back: int = 7) -> List[Dict[str, Any]]: + """ + Get design automation papers from specified categories over a date range. + + Args: + categories: List of arXiv category codes + days_back: Number of days to look back + + Returns: + List of design automation papers + """ + # Ensure data files exist + ensure_data_files(categories, days_back) + + # Collect papers + all_papers = [] + dates = get_date_range(days_back) + + for category in categories: + for date_str in dates: + try: + papers = get_papers(category) + all_papers.extend(papers) + except Exception as e: + logger.warning(f"Could not get papers for {category} on {date_str}: {e}") + + # Remove duplicates (papers can appear in multiple categories) + unique_papers = {} + for paper in all_papers: + paper_id = paper.get("main_page", "").split("/")[-1] + if paper_id and paper_id not in unique_papers: + unique_papers[paper_id] = paper + + # Filter design automation papers + design_papers = [] + for paper_id, paper in unique_papers.items(): + if is_design_automation_paper(paper): + paper["paper_id"] = paper_id + paper["design_category"] = categorize_design_paper(paper) + paper["design_techniques"] = analyze_design_techniques(paper) + paper["design_metrics"] = extract_design_metrics(paper) + design_papers.append(paper) + + # Sort by date (newest first) + design_papers.sort(key=lambda p: p.get("main_page", ""), reverse=True) + + return design_papers + +def print_paper_summary(paper: Dict[str, Any]) -> None: + """ + Print a nice summary of a paper to the console. + + Args: + paper: Paper dictionary + """ + print(f"\n{'=' * 80}") + print(f"TITLE: {paper.get('title', 'No title')}") + print(f"AUTHORS: {paper.get('authors', 'No authors')}") + print(f"URL: {paper.get('main_page', 'No URL')}") + print(f"DESIGN CATEGORY: {paper.get('design_category', 'Unknown')}") + print(f"TECHNIQUES: {', '.join(paper.get('design_techniques', []))}") + print(f"METRICS: {', '.join(paper.get('design_metrics', []))}") + print(f"\nABSTRACT: {paper.get('abstract', 'No abstract')[:500]}...") + print(f"{'=' * 80}\n") + +def generate_html_report(papers: List[Dict[str, Any]], output_file: str) -> None: + """ + Generate an HTML report from papers. + + Args: + papers: List of paper dictionaries + output_file: Path to output HTML file + """ + html = f""" + + + + + Design Automation Papers + + + +

Design Automation Papers

+
+

Found {len(papers)} papers related to graphic design automation with AI/ML

+

Generated on {datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")}

+
+ """ + + # Count categories and techniques + categories = {} + techniques = {} + + for paper in papers: + category = paper.get("design_category", "Uncategorized") + if category in categories: + categories[category] += 1 + else: + categories[category] = 1 + + for technique in paper.get("design_techniques", []): + if technique in techniques: + techniques[technique] += 1 + else: + techniques[technique] = 1 + + # Add summary statistics + html += "

Summary Statistics

" + + html += "

Categories:

" + + html += "

Techniques:

" + + # Add papers + for paper in papers: + publish_date = paper.get("main_page", "").split("/")[-1][:4] # Extract YYMM from id + + html += f""" +
+
{paper.get("title", "No title")}
+
{paper.get("authors", "Unknown authors")}
+
arXiv ID: {paper.get("paper_id", "Unknown")}
+
Category: {paper.get("design_category", "General")} | Subject: {paper.get("subjects", "N/A")}
+
Techniques: {', '.join(paper.get("design_techniques", ["None identified"]))}
+
Evaluation metrics: {', '.join(paper.get("design_metrics", ["None identified"]))}
+
Abstract: {paper.get("abstract", "No abstract available")}
+
+ """ + + html += """ + + + + """ + + with open(output_file, "w") as f: + f.write(html) + + logger.info(f"HTML report generated: {output_file}") + +def main(): + """Main function for the design finder module.""" + parser = argparse.ArgumentParser(description="Find the latest graphic design automation papers.") + parser.add_argument("--days", type=int, default=7, help="Number of days to look back") + parser.add_argument("--output", type=str, default="design_papers.json", help="Output JSON file path") + parser.add_argument("--html", type=str, default="design_papers.html", help="Output HTML file path") + parser.add_argument("--categories", type=str, nargs="+", default=DEFAULT_CATEGORIES, + help="arXiv categories to search") + parser.add_argument("--keyword", type=str, help="Additional keyword to filter papers") + parser.add_argument("--technique", type=str, help="Filter by specific technique") + parser.add_argument("--category", type=str, help="Filter by specific design category") + args = parser.parse_args() + + logger.info(f"Looking for design papers in the past {args.days} days") + logger.info(f"Searching categories: {', '.join(args.categories)}") + + # DATA_DIR is already created by paths.py + + # Get design papers + design_papers = get_design_papers(args.categories, args.days) + + # Apply additional filters if specified + if args.keyword: + keyword = args.keyword.lower() + design_papers = [ + p for p in design_papers + if keyword in p.get("title", "").lower() or + keyword in p.get("abstract", "").lower() + ] + logger.info(f"Filtered by keyword '{args.keyword}': {len(design_papers)} papers remaining") + + if args.technique: + technique = args.technique.lower() + design_papers = [ + p for p in design_papers + if any(technique in t.lower() for t in p.get("design_techniques", [])) + ] + logger.info(f"Filtered by technique '{args.technique}': {len(design_papers)} papers remaining") + + if args.category: + category = args.category.lower() + design_papers = [ + p for p in design_papers + if category in p.get("design_category", "").lower() + ] + logger.info(f"Filtered by category '{args.category}': {len(design_papers)} papers remaining") + + logger.info(f"Found {len(design_papers)} design automation papers") + + # Print summary to console + for paper in design_papers[:10]: # Print top 10 + print_paper_summary(paper) + + if len(design_papers) > 10: + print(f"...and {len(design_papers) - 10} more papers.") + + # Save to JSON file in data directory + output_path = os.path.join(DATA_DIR, args.output) + with open(output_path, "w") as f: + json.dump(design_papers, f, indent=2) + + logger.info(f"Saved {len(design_papers)} papers to {output_path}") + + # Generate HTML report in digest directory + html_path = os.path.join(DIGEST_DIR, args.html) + generate_html_report(design_papers, html_path) + + print(f"\nResults saved to {output_path} and {html_path}") + +if __name__ == "__main__": + main() \ No newline at end of file diff --git a/src/design_papers_crawler.py b/src/design_papers_crawler.py new file mode 100644 index 0000000..78641a2 --- /dev/null +++ b/src/design_papers_crawler.py @@ -0,0 +1,194 @@ +#!/usr/bin/env python3 +""" +Design Papers Crawler - A dedicated script to find the latest papers +on graphic design automation using AI/ML/LLM technologies. + +Usage: + python design_papers_crawler.py [--days 7] [--output design_papers.json] +""" + +import os +import sys +import json +import argparse +import datetime +import logging +from typing import List, Dict, Any + +# Add parent directory to path to import from sibling modules +sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) + +from src.download_new_papers import get_papers, _download_new_papers +from src.design_automation import ( + is_design_automation_paper, + categorize_design_paper, + analyze_design_techniques, + extract_design_metrics +) +from src.paths import DATA_DIR, DIGEST_DIR + +# Configure logging +logging.basicConfig( + level=logging.INFO, + format='%(asctime)s - %(name)s - %(levelname)s - %(message)s' +) +logger = logging.getLogger(__name__) + +# Default arXiv categories to search +DEFAULT_CATEGORIES = [ + "cs.CV", # Computer Vision + "cs.GR", # Graphics + "cs.HC", # Human-Computer Interaction + "cs.AI", # Artificial Intelligence + "cs.LG", # Machine Learning + "cs.CL", # Computation and Language (NLP) + "cs.MM", # Multimedia + "cs.SD", # Sound + "cs.RO", # Robotics (for interactive design) + "cs.CY" # Computers and Society +] + +def get_date_range(days_back: int = 7) -> List[str]: + """ + Get a list of dates for the past N days in arXiv format. + + Args: + days_back: Number of days to look back + + Returns: + List of date strings in arXiv format + """ + today = datetime.datetime.now() + dates = [] + + for i in range(days_back): + date = today - datetime.timedelta(days=i) + date_str = date.strftime("%a, %d %b %y") + dates.append(date_str) + + return dates + +def ensure_data_files(categories: List[str], days_back: int = 7) -> None: + """ + Make sure data files exist for the specified categories and date range. + + Args: + categories: List of arXiv category codes + days_back: Number of days to look back + """ + dates = get_date_range(days_back) + + for category in categories: + for date_str in dates: + file_path = os.path.join(DATA_DIR, f"{category}_{date_str}.jsonl") + + if not os.path.exists(file_path): + logger.info(f"Downloading papers for {category} on {date_str}") + try: + _download_new_papers(category) + except Exception as e: + logger.error(f"Error downloading {category} papers for {date_str}: {e}") + +def get_design_papers(categories: List[str], days_back: int = 7) -> List[Dict[str, Any]]: + """ + Get design automation papers from specified categories over a date range. + + Args: + categories: List of arXiv category codes + days_back: Number of days to look back + + Returns: + List of design automation papers + """ + # Ensure data files exist + ensure_data_files(categories, days_back) + + # Collect papers + all_papers = [] + dates = get_date_range(days_back) + + for category in categories: + for date_str in dates: + try: + papers = get_papers(category) + all_papers.extend(papers) + except Exception as e: + logger.warning(f"Could not get papers for {category} on {date_str}: {e}") + + # Remove duplicates (papers can appear in multiple categories) + unique_papers = {} + for paper in all_papers: + paper_id = paper.get("main_page", "").split("/")[-1] + if paper_id and paper_id not in unique_papers: + unique_papers[paper_id] = paper + + # Filter design automation papers + design_papers = [] + for paper_id, paper in unique_papers.items(): + if is_design_automation_paper(paper): + paper["paper_id"] = paper_id + paper["design_category"] = categorize_design_paper(paper) + paper["design_techniques"] = analyze_design_techniques(paper) + paper["design_metrics"] = extract_design_metrics(paper) + design_papers.append(paper) + + # Sort by date (newest first) + design_papers.sort(key=lambda p: p.get("main_page", ""), reverse=True) + + return design_papers + +def print_paper_summary(paper: Dict[str, Any]) -> None: + """ + Print a nice summary of a paper to the console. + + Args: + paper: Paper dictionary + """ + print(f"\n{'=' * 80}") + print(f"TITLE: {paper.get('title', 'No title')}") + print(f"AUTHORS: {paper.get('authors', 'No authors')}") + print(f"URL: {paper.get('main_page', 'No URL')}") + print(f"DESIGN CATEGORY: {paper.get('design_category', 'Unknown')}") + print(f"TECHNIQUES: {', '.join(paper.get('design_techniques', []))}") + print(f"METRICS: {', '.join(paper.get('design_metrics', []))}") + print(f"\nABSTRACT: {paper.get('abstract', 'No abstract')[:500]}...") + print(f"{'=' * 80}\n") + +def main(): + """Main function to run the design papers crawler.""" + parser = argparse.ArgumentParser(description="Find the latest graphic design automation papers.") + parser.add_argument("--days", type=int, default=7, help="Number of days to look back") + parser.add_argument("--output", type=str, default="design_papers.json", help="Output file path") + parser.add_argument("--categories", type=str, nargs="+", default=DEFAULT_CATEGORIES, + help="arXiv categories to search") + args = parser.parse_args() + + logger.info(f"Looking for design papers in the past {args.days} days") + logger.info(f"Searching categories: {', '.join(args.categories)}") + + # DATA_DIR is already created by paths.py + + # Get design papers + design_papers = get_design_papers(args.categories, args.days) + + logger.info(f"Found {len(design_papers)} design automation papers") + + # Print summary to console + for paper in design_papers[:10]: # Print top 10 + print_paper_summary(paper) + + if len(design_papers) > 10: + print(f"...and {len(design_papers) - 10} more papers.") + + # Determine output path - ensure it's in DATA_DIR + output_path = os.path.join(DATA_DIR, args.output) + + # Save to file + with open(output_path, "w") as f: + json.dump(design_papers, f, indent=2) + + logger.info(f"Saved {len(design_papers)} papers to {output_path}") + print(f"\nResults saved to {output_path}") + +if __name__ == "__main__": + main() \ No newline at end of file diff --git a/src/download_new_papers.py b/src/download_new_papers.py index b07b22c..56a0f3f 100644 --- a/src/download_new_papers.py +++ b/src/download_new_papers.py @@ -1,5 +1,8 @@ # encoding: utf-8 import os +import re +from urllib.error import HTTPError + import tqdm from bs4 import BeautifulSoup as bs import urllib.request @@ -7,10 +10,54 @@ import datetime import pytz +# Import standardized paths +from paths import DATA_DIR + +#Linh - add new def crawl_html_version(html_link) here +def crawl_html_version(html_link): + main_content = [] + try: + # Add user-agent header to appear more like a browser + headers = { + 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36' + } + req = urllib.request.Request(html_link, headers=headers) + html = urllib.request.urlopen(req) + except HTTPError as e: + return f"Error accessing HTML: {str(e)}" + + soup = bs(html) + content = soup.find('div', attrs={'class': 'ltx_page_content'}) + if not content: + return "Content not available in HTML format" + para_list = content.find_all("div", attrs={'class': 'ltx_para'}) + for each in para_list: + main_content.append(each.text.strip()) + return ' '.join(main_content)[:10000] + #if len(main_content >) + #return ''.join(main_content) if len(main_content) < 20000 else ''.join(main_content[:20000]) + +#Linh - add because cs sub does not have abstract displayed, will revert if it comes back +def crawl_abstract(html_link): + main_content = [] + try: + html = urllib.request.urlopen(html_link) + except HTTPError as e: + return ["None"] + soup = bs(html) + content = soup.find('blockquote', attrs={'class': 'abstract'}).text.replace("Abstract:", "").strip() + return content def _download_new_papers(field_abbr): NEW_SUB_URL = f'https://arxiv.org/list/{field_abbr}/new' # https://arxiv.org/list/cs/new - page = urllib.request.urlopen(NEW_SUB_URL) + print(NEW_SUB_URL) + # Add user-agent header to appear more like a browser + headers = { + 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36' + } + req = urllib.request.Request(NEW_SUB_URL, headers=headers) + page = urllib.request.urlopen(req) + soup = bs(page) content = soup.body.find("div", {'id': 'content'}) @@ -21,31 +68,40 @@ def _download_new_papers(field_abbr): dt_list = content.dl.find_all("dt") dd_list = content.dl.find_all("dd") arxiv_base = "https://arxiv.org/abs/" + arxiv_html = "https://arxiv.org/html/" assert len(dt_list) == len(dd_list) new_paper_list = [] for i in tqdm.tqdm(range(len(dt_list))): paper = {} - paper_number = dt_list[i].text.strip().split(" ")[2].split(":")[-1] + ahref = dt_list[i].find('a', href = re.compile(r'[/]([a-z]|[A-Z])\w+')).attrs['href'] + paper_number = ahref.strip().replace("/abs/", "") + paper['main_page'] = arxiv_base + paper_number paper['pdf'] = arxiv_base.replace('abs', 'pdf') + paper_number - paper['title'] = dd_list[i].find("div", {"class": "list-title mathjax"}).text.replace("Title: ", "").strip() + paper['title'] = dd_list[i].find("div", {"class": "list-title mathjax"}).text.replace("Title:\n", "").strip() paper['authors'] = dd_list[i].find("div", {"class": "list-authors"}).text \ .replace("Authors:\n", "").replace("\n", "").strip() - paper['subjects'] = dd_list[i].find("div", {"class": "list-subjects"}).text.replace("Subjects: ", "").strip() + paper['subjects'] = dd_list[i].find("div", {"class": "list-subjects"}).text.replace("Subjects:\n", "").strip() + #print(dd_list[i].find("div", {"class": "list-subjects"}).text.replace("Subjects:\n", "").strip()) + + #TODO: edit the abstract part - it is currently moved paper['abstract'] = dd_list[i].find("p", {"class": "mathjax"}).text.replace("\n", " ").strip() + try: + paper['content'] = crawl_html_version(arxiv_html + paper_number + "v1") + except Exception as e: + paper['content'] = f"Error fetching content: {str(e)}" new_paper_list.append(paper) - # check if ./data exist, if not, create it - if not os.path.exists("./data"): - os.makedirs("./data") + # DATA_DIR is already created by paths.py # save new_paper_list to a jsonl file, with each line as the element of a dictionary date = datetime.date.fromtimestamp(datetime.datetime.now(tz=pytz.timezone("America/New_York")).timestamp()) date = date.strftime("%a, %d %b %y") - with open(f"./data/{field_abbr}_{date}.jsonl", "w") as f: + file_path = os.path.join(DATA_DIR, f"{field_abbr}_{date}.jsonl") + with open(file_path, "w") as f: for paper in new_paper_list: f.write(json.dumps(paper) + "\n") @@ -53,12 +109,15 @@ def _download_new_papers(field_abbr): def get_papers(field_abbr, limit=None): date = datetime.date.fromtimestamp(datetime.datetime.now(tz=pytz.timezone("America/New_York")).timestamp()) date = date.strftime("%a, %d %b %y") - if not os.path.exists(f"./data/{field_abbr}_{date}.jsonl"): + file_path = os.path.join(DATA_DIR, f"{field_abbr}_{date}.jsonl") + if not os.path.exists(file_path): _download_new_papers(field_abbr) results = [] - with open(f"./data/{field_abbr}_{date}.jsonl", "r") as f: + with open(file_path, "r") as f: for i, line in enumerate(f.readlines()): if limit and i == limit: return results results.append(json.loads(line)) return results + +#crawl_html_version("https://arxiv.org/html/2404.11972v1") diff --git a/src/fix_parser.py b/src/fix_parser.py new file mode 100644 index 0000000..c003f5e --- /dev/null +++ b/src/fix_parser.py @@ -0,0 +1,91 @@ +""" +A script to fix and test the OpenAI response parsing. +""" +import json +import re +import os + +def is_valid_json(text): + try: + json.loads(text) + return True + except json.JSONDecodeError: + return False + +def extract_json_from_string(text): + """ + Attempt to extract JSON from a string by finding '{'...'}' + """ + # Find the outermost JSON object + stack = [] + start_idx = -1 + + for i, char in enumerate(text): + if char == '{' and start_idx == -1: + start_idx = i + stack.append(char) + elif char == '{': + stack.append(char) + elif char == '}' and stack: + stack.pop() + if not stack and start_idx != -1: + # Found complete JSON object + json_str = text[start_idx:i+1] + try: + parsed = json.loads(json_str) + return parsed + except json.JSONDecodeError: + # If this one fails, continue looking + start_idx = -1 + + return None + +def fix_openai_response(response_text): + """ + Fix the OpenAI response by handling different formats and parsing the JSON. + Returns a list of dictionaries with paper analysis. + """ + # First, try to parse the entire response as JSON + cleaned_text = response_text.strip() + + # Try to extract JSON directly + if '{' in cleaned_text and '}' in cleaned_text: + json_obj = extract_json_from_string(cleaned_text) + if json_obj and "Relevancy score" in json_obj: + print(f"Successfully extracted JSON with score {json_obj['Relevancy score']}") + return [json_obj] + + return [] + +# Example usage +if __name__ == "__main__": + example_response = """ + "Relevancy score": 7, + "Reasons for match": "This paper aligns with your research interests as it explores the application of Large Language Models (LLMs) in the context of hardware design. It introduces a unified framework, Marco, that integrates configurable graph-based task solving with multi-modality and multi-AI agents for chip design. This is relevant to your interests in AI Alignment, AI safety, Large Language Models, and Multimodal Learning.", + "Key innovations": [ + "Introduction of Marco, a unified framework that integrates configurable graph-based task solving with multi-modality and multi-AI agents for chip design.", + "Demonstration of promising performance, productivity, and efficiency of LLM agents by leveraging the Marco framework on layout optimization, Verilog/design rule checker (DRC) coding, and timing analysis tasks." + ], + "Critical analysis": "The paper presents a novel approach to leveraging LLMs in the field of hardware design, which could have significant implications for improving efficiency and reducing costs. However, without access to the full paper, it's difficult to assess the strengths and potential limitations of the approach.", + "Goal": "The paper addresses the challenge of optimizing performance, power, area, and cost (PPAC) during synthesis, verification, physical design, and reliability loops in hardware design. It aims to reduce turn-around-time (TAT) for these processes by leveraging the capabilities of LLMs.", + "Data": "Unable to provide details about the datasets used due to lack of access to the full paper content.", + "Methodology": "The paper proposes a unified framework, Marco, that integrates configurable graph-based task solving with multi-modality and multi-AI agents for chip design. However, detailed methodology is not available due to lack of access to the full paper content.", + "Implementation details": "Unable to provide implementation details due to lack of access to the full paper content.", + "Git": "Link to code repository is not provided in the abstract.", + "Experiments & Results": "The abstract mentions that the Marco framework demonstrates promising performance on layout optimization, Verilog/design rule checker (DRC) coding, and timing analysis tasks. However, detailed results and comparisons are not available due to lack of access to the full paper content.", + "Discussion & Next steps": "Unable to provide details on the authors' conclusions, identified limitations, and future research directions due to lack of access to the full paper content.", + "Related work": "Unable to provide details on how this paper relates to similar recent papers in the field due to lack of access to the full paper content.", + "Practical applications": "The framework proposed in this paper could have practical applications in the field of hardware design, potentially leading to faster product cycles, lower costs, improved design reliability and reduced risk of costly errors.", + "Key takeaways": [ + "The paper proposes a unified framework, Marco, that integrates configurable graph-based task solving with multi-modality and multi-AI agents for chip design.", + "The Marco framework leverages the capabilities of Large Language Models (LLMs) to improve efficiency and reduce costs in hardware design.", + "The framework demonstrates promising performance on layout optimization, Verilog/design rule checker (DRC) coding, and timing analysis tasks." + ] +} + """ + + # Test the fix + results = fix_openai_response(example_response) + print(f"Found {len(results)} paper analyses") + for i, result in enumerate(results): + print(f"Paper {i+1} score: {result.get('Relevancy score', 'Not found')}") \ No newline at end of file diff --git a/src/gemini_utils.py b/src/gemini_utils.py new file mode 100644 index 0000000..b5634db --- /dev/null +++ b/src/gemini_utils.py @@ -0,0 +1,269 @@ +""" +Gemini API integration for ArxivDigest. +This module provides functions to work with Google's Gemini API for paper analysis. +""" +import os +import json +import logging +import time +from typing import List, Dict, Any, Optional + +try: + import google.generativeai as genai + from google.api_core.exceptions import GoogleAPIError + GEMINI_AVAILABLE = True +except ImportError: + GEMINI_AVAILABLE = False + +# Configure logging +logging.basicConfig(level=logging.INFO) +logger = logging.getLogger(__name__) + +class GeminiConfig: + """Configuration for Gemini API calls.""" + def __init__( + self, + temperature: float = 0.4, + max_output_tokens: int = 2048, + top_p: float = 0.95, + top_k: int = 40 + ): + self.temperature = temperature + self.max_output_tokens = max_output_tokens + self.top_p = top_p + self.top_k = top_k + +def setup_gemini_api(api_key: str) -> bool: + """ + Setup the Gemini API with the provided API key. + + Args: + api_key: Gemini API key + + Returns: + bool: True if setup was successful, False otherwise + """ + if not GEMINI_AVAILABLE: + logger.error("Gemini package not installed. Run 'pip install google-generativeai'") + return False + + if not api_key: + logger.error("No Gemini API key provided") + return False + + try: + genai.configure(api_key=api_key) + # Test API connection + models = genai.list_models() + logger.info(f"Successfully connected to Gemini API. Available models: {[m.name for m in models if 'generateContent' in m.supported_generation_methods]}") + return True + except Exception as e: + logger.error(f"Failed to setup Gemini API: {e}") + return False + +def get_gemini_model(model_name: str = "gemini-1.5-flash"): + """ + Get a Gemini model by name. + + Args: + model_name: Name of the Gemini model + + Returns: + Model object or None if not available + """ + if not GEMINI_AVAILABLE: + return None + + try: + model = genai.GenerativeModel(model_name) + return model + except Exception as e: + logger.error(f"Failed to get Gemini model: {e}") + return None + +def analyze_papers_with_gemini( + papers: List[Dict[str, Any]], + query: Dict[str, str], + config: Optional[GeminiConfig] = None, + model_name: str = "gemini-1.5-flash" +) -> List[Dict[str, Any]]: + """ + Analyze papers using the Gemini model. + + Args: + papers: List of paper dictionaries + query: Dictionary with 'interest' key describing research interests + config: GeminiConfig object + model_name: Name of the Gemini model to use + + Returns: + List of papers with added analysis + """ + if not GEMINI_AVAILABLE: + logger.error("Gemini package not installed. Cannot analyze papers.") + return papers + + if not config: + config = GeminiConfig() + + model = get_gemini_model(model_name) + if not model: + return papers + + analyzed_papers = [] + + for paper in papers: + try: + # Prepare prompt + prompt = f""" + You are a research assistant analyzing academic papers in AI and ML. + + Analyze this paper and provide insights based on the user's research interests. + + Research interests: {query['interest']} + + Paper details: + Title: {paper['title']} + Authors: {paper['authors']} + Abstract: {paper['abstract']} + Content: {paper['content'][:5000]} + + Please provide your response as a single JSON object with the following structure: + {{ + "Relevancy score": 1-10 (higher = more relevant), + "Reasons for match": "Detailed explanation of why this paper matches the interests", + "Key innovations": "List the main contributions of the paper", + "Critical analysis": "Evaluate strengths and weaknesses", + "Goal": "What problem does the paper address?", + "Data": "Description of datasets used", + "Methodology": "Technical approach and methods", + "Implementation details": "Model architecture, hyperparameters, etc.", + "Experiments & Results": "Key findings and comparisons", + "Discussion & Next steps": "Limitations and future work", + "Related work": "Connection to similar research", + "Practical applications": "Real-world uses of this research", + "Key takeaways": ["Point 1", "Point 2", "Point 3"] + }} + + Format your response as a valid JSON object and nothing else. + """ + + # Just log that we're sending a prompt to Gemini + print(f"Sending prompt to Gemini for paper: {paper['title'][:50]}...") + + generation_config = { + "temperature": config.temperature, + "top_p": config.top_p, + "top_k": config.top_k, + "max_output_tokens": config.max_output_tokens, + } + + response = model.generate_content( + prompt, + generation_config=generation_config + ) + + # Extract and parse the response + response_text = response.text + + # Try to extract JSON + try: + start_idx = response_text.find('{') + end_idx = response_text.rfind('}') + 1 + if start_idx >= 0 and end_idx > start_idx: + json_str = response_text[start_idx:end_idx] + gemini_analysis = json.loads(json_str) + + # Add Gemini analysis to paper + paper['gemini_analysis'] = gemini_analysis + + # Directly copy fields to paper + for key, value in gemini_analysis.items(): + paper[key] = value + else: + logger.warning(f"Could not extract JSON from Gemini response for paper {paper['title']}") + paper['gemini_analysis'] = {"error": "Failed to parse response"} + except json.JSONDecodeError: + logger.warning(f"Failed to parse Gemini response as JSON for paper {paper['title']}") + paper['gemini_analysis'] = {"error": "Failed to parse response"} + + analyzed_papers.append(paper) + + # Avoid rate limiting + time.sleep(1) + + except GoogleAPIError as e: + logger.error(f"Gemini API error: {e}") + paper['gemini_analysis'] = {"error": f"Gemini API error: {str(e)}"} + analyzed_papers.append(paper) + + except Exception as e: + logger.error(f"Error analyzing paper with Gemini: {e}") + paper['gemini_analysis'] = {"error": f"Error: {str(e)}"} + analyzed_papers.append(paper) + + return analyzed_papers + +def get_topic_clustering(papers: List[Dict[str, Any]], model_name: str = "gemini-1.5-flash"): + """ + Cluster papers by topic using Gemini. + + Args: + papers: List of paper dictionaries + model_name: Name of the Gemini model to use + + Returns: + Dictionary with topic clusters + """ + if not GEMINI_AVAILABLE: + logger.error("Gemini package not installed. Cannot cluster papers.") + return {} + + model = get_gemini_model(model_name) + if not model: + return {} + + # Create a condensed representation of the papers + paper_summaries = [] + for i, paper in enumerate(papers): + paper_summaries.append(f"{i+1}. Title: {paper['title']}\nAbstract: {paper['abstract'][:300]}...") + + paper_text = "\n\n".join(paper_summaries) + + prompt = f""" + You are a research librarian organizing academic papers into topic clusters. + + Analyze these papers and group them into 3-7 thematic clusters: + + {paper_text} + + For each cluster: + 1. Provide a descriptive name for the cluster + 2. List the paper numbers that belong to this cluster + 3. Explain why these papers belong together + + Format your response as JSON with these fields: "clusters" (an array of objects with "name", "papers", and "description" fields). + """ + + try: + response = model.generate_content(prompt) + response_text = response.text + + # Try to extract JSON + try: + start_idx = response_text.find('{') + end_idx = response_text.rfind('}') + 1 + if start_idx >= 0 and end_idx > start_idx: + json_str = response_text[start_idx:end_idx] + cluster_data = json.loads(json_str) + return cluster_data + else: + logger.warning("Could not extract JSON from Gemini clustering response") + return {"error": "Failed to parse clustering response"} + except json.JSONDecodeError: + logger.warning("Failed to parse Gemini clustering response as JSON") + return {"error": "Failed to parse clustering response"} + + except Exception as e: + logger.error(f"Error clustering papers with Gemini: {e}") + return {"error": f"Clustering error: {str(e)}"} \ No newline at end of file diff --git a/src/interpretability_analysis.py b/src/interpretability_analysis.py new file mode 100644 index 0000000..b503567 --- /dev/null +++ b/src/interpretability_analysis.py @@ -0,0 +1,232 @@ +""" +Specialized module for mechanistic interpretability and technical AI safety analysis. +""" +import json +import logging +from typing import Dict, Any, List, Optional + +# Configure logging +logging.basicConfig(level=logging.INFO) +logger = logging.getLogger(__name__) + +# Prompts for specialized analysis +MECHANISTIC_INTERPRETABILITY_PROMPT = """ +You are a research assistant specializing in mechanistic interpretability of AI systems. + +Analyze this paper from the perspective of mechanistic interpretability: + +Title: {title} +Authors: {authors} +Abstract: {abstract} +Content: {content} + +Please provide a detailed analysis covering: + +1. Relevance to mechanistic interpretability: How does this paper contribute to understanding the internal workings of models? +2. Interpretability techniques: What specific methods or approaches does the paper use to explain model behavior? +3. Circuit analysis: Does the paper identify specific circuits or computational components within models? +4. Attribution methods: What techniques are used to attribute model outputs to internal components? +5. Novel insights: What new understanding does this paper bring to model internals? +6. Limitations: What are the limitations of the approach from an interpretability perspective? +7. Future directions: What follow-up work would be valuable? +8. Connections to other interpretability research: How does this relate to other work in the field? + +Format your response as JSON with these fields. +""" + +TECHNICAL_AI_SAFETY_PROMPT = """ +You are a research assistant specializing in technical AI safety. + +Analyze this paper from the perspective of technical AI safety: + +Title: {title} +Authors: {authors} +Abstract: {abstract} +Content: {content} + +Please provide a detailed analysis covering: + +1. Relevance to AI safety: How does this paper contribute to building safer AI systems? +2. Safety approaches: What specific methods or approaches does the paper use to improve AI safety? +3. Robustness: How does the paper address model robustness to distribution shifts or adversarial attacks? +4. Alignment: Does the paper discuss techniques for aligning AI systems with human values? +5. Risk assessment: What potential risks or failure modes does the paper address? +6. Monitoring and oversight: What methods are proposed for monitoring or controlling AI systems? +7. Limitations: What are the limitations of the approach from a safety perspective? +8. Future directions: What follow-up work would be valuable for improving safety? + +Format your response as JSON with these fields. +""" + +PROMPT_TEMPLATES = { + "mechanistic_interpretability": MECHANISTIC_INTERPRETABILITY_PROMPT, + "technical_ai_safety": TECHNICAL_AI_SAFETY_PROMPT +} + +def extract_json_from_text(text: str) -> Dict[str, Any]: + """ + Attempt to extract JSON from text, handling various formats. + + Args: + text: String potentially containing JSON + + Returns: + Extracted JSON as a dictionary, or error dictionary + """ + try: + # Look for JSON-like structures + start_idx = text.find('{') + end_idx = text.rfind('}') + 1 + + if start_idx >= 0 and end_idx > start_idx: + json_str = text[start_idx:end_idx] + return json.loads(json_str) + else: + return {"error": "Could not find JSON in text", "raw_text": text} + except json.JSONDecodeError: + return {"error": "Failed to parse as JSON", "raw_text": text} + +def create_analysis_prompt(paper: Dict[str, Any], analysis_type: str) -> str: + """ + Create a prompt for specialized analysis. + + Args: + paper: Dictionary with paper details + analysis_type: Type of analysis to perform + + Returns: + Formatted prompt string + """ + if analysis_type not in PROMPT_TEMPLATES: + raise ValueError(f"Unknown analysis type: {analysis_type}") + + prompt_template = PROMPT_TEMPLATES[analysis_type] + + return prompt_template.format( + title=paper.get("title", ""), + authors=paper.get("authors", ""), + abstract=paper.get("abstract", ""), + content=paper.get("content", "")[:10000] # Limit content length + ) + +def analyze_interpretability_circuits(paper: Dict[str, Any], response: Dict[str, Any]) -> Dict[str, Any]: + """ + Perform additional circuit analysis based on paper content and initial response. + + Args: + paper: Dictionary with paper details + response: Initial analysis response + + Returns: + Enhanced analysis with circuit information + """ + # This is a placeholder for more sophisticated circuit analysis + # In a real implementation, this would use specialized tools to analyze + # neural network circuits mentioned in the paper + + # Extract potential circuit descriptions from paper content + circuit_mentions = [] + + content = paper.get("content", "").lower() + circuit_keywords = ["circuit", "attention head", "neuron", "mlp", "weight", "activation"] + + for keyword in circuit_keywords: + if keyword in content: + # Very simple extraction - in reality would use more sophisticated NLP + start_idx = content.find(keyword) + if start_idx >= 0: + excerpt = content[max(0, start_idx-50):min(len(content), start_idx+100)] + circuit_mentions.append(excerpt) + + # Add circuit information to response + enhanced_response = response.copy() + enhanced_response["circuit_mentions"] = circuit_mentions[:5] # Limit to 5 mentions + enhanced_response["circuit_analysis_performed"] = len(circuit_mentions) > 0 + + return enhanced_response + +def get_paper_relation_to_ai_safety(paper: Dict[str, Any]) -> str: + """ + Determine how a paper relates to AI safety research. + + Args: + paper: Dictionary with paper details + + Returns: + Description of relation to AI safety + """ + # Simple keyword-based approach + safety_keywords = { + "alignment": "AI alignment", + "safety": "AI safety", + "robustness": "Model robustness", + "adversarial": "Adversarial robustness", + "bias": "Bias mitigation", + "fairness": "Fairness", + "transparency": "Transparency", + "interpretability": "Interpretability", + "explainability": "Explainability", + "oversight": "AI oversight", + "control": "AI control", + "verification": "Formal verification", + "monitoring": "AI monitoring" + } + + relation = [] + content = (paper.get("abstract", "") + " " + paper.get("title", "")).lower() + + for keyword, category in safety_keywords.items(): + if keyword in content: + relation.append(category) + + if relation: + return ", ".join(set(relation)) + else: + return "No direct relation to AI safety identified" + +def analyze_multi_agent_safety(paper: Dict[str, Any]) -> Dict[str, Any]: + """ + Analyze multi-agent safety aspects of a paper. + + Args: + paper: Dictionary with paper details + + Returns: + Multi-agent safety analysis + """ + # Check if paper mentions multi-agent systems + content = (paper.get("abstract", "") + " " + paper.get("title", "")).lower() + + multi_agent_keywords = [ + "multi-agent", "multiagent", "agent cooperation", "agent competition", + "game theory", "nash equilibrium", "cooperative ai", "collaborative ai" + ] + + is_multi_agent = any(keyword in content for keyword in multi_agent_keywords) + + if not is_multi_agent: + return {"is_multi_agent_focused": False} + + # Simple analysis of multi-agent safety aspects + safety_aspects = [] + + if "cooperation" in content or "collaborative" in content or "coordination" in content: + safety_aspects.append("Agent cooperation") + + if "competition" in content or "adversarial" in content: + safety_aspects.append("Agent competition") + + if "equilibrium" in content or "game theory" in content: + safety_aspects.append("Game theoretic analysis") + + if "incentive" in content or "reward" in content: + safety_aspects.append("Incentive design") + + if "communication" in content: + safety_aspects.append("Agent communication") + + return { + "is_multi_agent_focused": True, + "multi_agent_safety_aspects": safety_aspects, + "summary": f"This paper focuses on multi-agent systems, specifically addressing: {', '.join(safety_aspects)}" if safety_aspects else "This paper discusses multi-agent systems but doesn't specifically address safety aspects." + } \ No newline at end of file diff --git a/src/model_manager.py b/src/model_manager.py new file mode 100644 index 0000000..9f60d3e --- /dev/null +++ b/src/model_manager.py @@ -0,0 +1,435 @@ +""" +Model Manager module to handle different LLM providers. +This provides a unified interface for working with different LLM providers. +""" +import os +import json +import logging +import time +from typing import Dict, List, Any, Optional, Union, Tuple +from enum import Enum + +import openai +try: + import google.generativeai as genai + GEMINI_AVAILABLE = True +except ImportError: + GEMINI_AVAILABLE = False + +try: + import anthropic + ANTHROPIC_AVAILABLE = True +except ImportError: + ANTHROPIC_AVAILABLE = False + +# Configure logging +logging.basicConfig(level=logging.INFO) +logger = logging.getLogger(__name__) + +class ModelProvider(Enum): + OPENAI = "openai" + GEMINI = "gemini" + ANTHROPIC = "anthropic" + +class ModelManager: + """Manager for handling different LLM providers.""" + + def __init__(self): + self.providers = {} + self.available_models = {} + + def register_openai(self, api_key: str) -> bool: + """Register OpenAI as a provider.""" + if not api_key: + logger.error("No OpenAI API key provided") + return False + + try: + openai.api_key = api_key + # Test API connection + models = openai.Model.list() + self.providers[ModelProvider.OPENAI] = True + self.available_models[ModelProvider.OPENAI] = [model.id for model in models.data] + logger.info(f"Successfully connected to OpenAI API. Available models: {self.available_models[ModelProvider.OPENAI]}") + return True + except Exception as e: + logger.error(f"Failed to setup OpenAI API: {e}") + return False + + def register_gemini(self, api_key: str) -> bool: + """Register Gemini as a provider.""" + if not GEMINI_AVAILABLE: + logger.error("Gemini package not installed. Run 'pip install google-generativeai'") + return False + + if not api_key: + logger.error("No Gemini API key provided") + return False + + try: + genai.configure(api_key=api_key) + # Test API connection + models = genai.list_models() + self.providers[ModelProvider.GEMINI] = True + self.available_models[ModelProvider.GEMINI] = [m.name for m in models if 'generateContent' in m.supported_generation_methods] + logger.info(f"Successfully connected to Gemini API. Available models: {self.available_models[ModelProvider.GEMINI]}") + return True + except Exception as e: + logger.error(f"Failed to setup Gemini API: {e}") + return False + + def register_anthropic(self, api_key: str) -> bool: + """Register Anthropic/Claude as a provider.""" + if not ANTHROPIC_AVAILABLE: + logger.error("Anthropic package not installed. Run 'pip install anthropic'") + return False + + if not api_key: + logger.error("No Anthropic API key provided") + return False + + try: + self.anthropic_client = anthropic.Anthropic(api_key=api_key) + # Test API connection by listing models + models = self.anthropic_client.models.list() + self.providers[ModelProvider.ANTHROPIC] = True + self.available_models[ModelProvider.ANTHROPIC] = [model.id for model in models.data] + logger.info(f"Successfully connected to Anthropic API. Available models: {self.available_models[ModelProvider.ANTHROPIC]}") + return True + except Exception as e: + logger.error(f"Failed to setup Anthropic API: {e}") + return False + + def is_provider_available(self, provider: ModelProvider) -> bool: + """Check if a provider is available.""" + return provider in self.providers and self.providers[provider] + + def get_available_providers(self) -> List[ModelProvider]: + """Get a list of available providers.""" + return [provider for provider in self.providers if self.providers[provider]] + + def get_provider_models(self, provider: ModelProvider) -> List[str]: + """Get available models for a provider.""" + if provider in self.available_models: + return self.available_models[provider] + return [] + + def analyze_papers( + self, + papers: List[Dict[str, Any]], + query: Dict[str, str], + providers: List[ModelProvider] = None, + model_names: Dict[ModelProvider, str] = None, + threshold_score: int = 7, + ) -> Tuple[List[Dict[str, Any]], bool]: + """ + Analyze papers using multiple model providers. + + Args: + papers: List of paper dictionaries + query: Dictionary with 'interest' key describing research interests + providers: List of providers to use (defaults to all available) + model_names: Dictionary mapping providers to model names + threshold_score: Minimum score for a paper to be considered relevant + + Returns: + Tuple of (list of papers with analysis, hallucination flag) + """ + if not providers: + providers = self.get_available_providers() + + if not model_names: + model_names = {} + + # Default model names if not specified + default_models = { + ModelProvider.OPENAI: "gpt-3.5-turbo-16k", + ModelProvider.GEMINI: "gemini-1.5-flash", + ModelProvider.ANTHROPIC: "claude-3.5-sonnet-20240620" + } + + # Use default models if not specified + for provider in providers: + if provider not in model_names: + model_names[provider] = default_models.get(provider) + + # Check if any providers are available + if not any(self.is_provider_available(provider) for provider in providers): + logger.error("No available providers for paper analysis") + return papers, False + + analyzed_papers = [] + hallucination = False + + # Import the modules here to avoid circular imports + if ModelProvider.OPENAI in providers and self.is_provider_available(ModelProvider.OPENAI): + from relevancy import generate_relevance_score + try: + analyzed_papers, hallu = generate_relevance_score( + papers, + query=query, + model_name=model_names[ModelProvider.OPENAI], + threshold_score=threshold_score, + num_paper_in_prompt=2 + ) + hallucination = hallucination or hallu + except Exception as e: + logger.error(f"Error analyzing papers with OpenAI: {e}") + + # Add Gemini analysis if available + if ModelProvider.GEMINI in providers and self.is_provider_available(ModelProvider.GEMINI): + # Import locally to avoid circular imports + from gemini_utils import analyze_papers_with_gemini + + try: + if not analyzed_papers: # If OpenAI analysis failed or was not used + analyzed_papers = papers + + analyzed_papers = analyze_papers_with_gemini( + analyzed_papers, + query=query, + model_name=model_names[ModelProvider.GEMINI] + ) + except Exception as e: + logger.error(f"Error analyzing papers with Gemini: {e}") + + # Add Anthropic/Claude analysis if available + if ModelProvider.ANTHROPIC in providers and self.is_provider_available(ModelProvider.ANTHROPIC): + # Import locally to avoid circular imports + from anthropic_utils import analyze_papers_with_claude + + try: + if not analyzed_papers: # If previous analyses failed or were not used + analyzed_papers = papers + + analyzed_papers = analyze_papers_with_claude( + analyzed_papers, + query=query, + model_name=model_names[ModelProvider.ANTHROPIC] + ) + except Exception as e: + logger.error(f"Error analyzing papers with Claude: {e}") + + return analyzed_papers, hallucination + + def get_mechanistic_interpretability_analysis( + self, + paper: Dict[str, Any], + provider: ModelProvider = None, + model_name: str = None + ) -> Dict[str, Any]: + """ + Get specialized mechanistic interpretability analysis for a paper. + + Args: + paper: Paper dictionary + provider: Provider to use (defaults to first available) + model_name: Model name to use + + Returns: + Dictionary with mechanistic interpretability analysis + """ + # Import interpretability analysis functions + from interpretability_analysis import ( + create_analysis_prompt, + extract_json_from_text, + analyze_interpretability_circuits, + get_paper_relation_to_ai_safety + ) + + if not provider: + available_providers = self.get_available_providers() + if not available_providers: + logger.error("No available providers for mechanistic interpretability analysis") + return {"error": "No available providers"} + provider = available_providers[0] + + if not model_name: + # Use more powerful models for specialized analysis + default_models = { + ModelProvider.OPENAI: "gpt-4o", + ModelProvider.GEMINI: "gemini-2.0-flash", + ModelProvider.ANTHROPIC: "claude-3.5-sonnet-20240620" + } + model_name = default_models.get(provider) + + if not self.is_provider_available(provider): + logger.error(f"Provider {provider} is not available") + return {"error": f"Provider {provider} is not available"} + + # Get specialized prompt + prompt = create_analysis_prompt(paper, "mechanistic_interpretability") + + # Process based on provider + if provider == ModelProvider.OPENAI: + try: + response = openai.ChatCompletion.create( + model=model_name, + messages=[ + {"role": "system", "content": "You are a specialist in mechanistic interpretability and AI safety."}, + {"role": "user", "content": prompt} + ], + temperature=0.3, + max_tokens=2048 + ) + + # Extract JSON from response + content = response.choices[0].message.content + analysis = extract_json_from_text(content) + + # Add additional circuit analysis if there's no error + if "error" not in analysis: + analysis = analyze_interpretability_circuits(paper, analysis) + analysis["ai_safety_relation"] = get_paper_relation_to_ai_safety(paper) + + return analysis + + except Exception as e: + logger.error(f"Error getting mechanistic interpretability analysis with OpenAI: {e}") + return {"error": f"OpenAI error: {str(e)}"} + + elif provider == ModelProvider.GEMINI and GEMINI_AVAILABLE: + try: + model = genai.GenerativeModel(model_name) + response = model.generate_content(prompt) + + # Extract JSON from response + content = response.text + analysis = extract_json_from_text(content) + + # Add additional circuit analysis if there's no error + if "error" not in analysis: + analysis = analyze_interpretability_circuits(paper, analysis) + analysis["ai_safety_relation"] = get_paper_relation_to_ai_safety(paper) + + return analysis + + except Exception as e: + logger.error(f"Error getting mechanistic interpretability analysis with Gemini: {e}") + return {"error": f"Gemini error: {str(e)}"} + + elif provider == ModelProvider.ANTHROPIC and ANTHROPIC_AVAILABLE: + try: + response = self.anthropic_client.messages.create( + model=model_name, + max_tokens=2048, + temperature=0.3, + system="You are a specialist in mechanistic interpretability and AI safety.", + messages=[ + {"role": "user", "content": prompt} + ] + ) + + # Extract JSON from response + content = response.content[0].text + analysis = extract_json_from_text(content) + + # Add additional circuit analysis if there's no error + if "error" not in analysis: + analysis = analyze_interpretability_circuits(paper, analysis) + analysis["ai_safety_relation"] = get_paper_relation_to_ai_safety(paper) + + return analysis + + except Exception as e: + logger.error(f"Error getting mechanistic interpretability analysis with Claude: {e}") + return {"error": f"Claude error: {str(e)}"} + + return {"error": "Unsupported provider or configuration"} + + def analyze_design_automation( + self, + paper: Dict[str, Any], + provider: ModelProvider = None, + model_name: str = None + ) -> Dict[str, Any]: + """ + Get specialized analysis for design automation papers. + + Args: + paper: Paper dictionary + provider: Provider to use (defaults to first available) + model_name: Model name to use + + Returns: + Dictionary with design automation analysis + """ + # Import design automation functions + from design_automation import ( + create_design_analysis_prompt, + extract_design_capabilities + ) + from interpretability_analysis import extract_json_from_text + + if not provider: + available_providers = self.get_available_providers() + if not available_providers: + logger.error("No available providers for design automation analysis") + return {"error": "No available providers"} + provider = available_providers[0] + + if not model_name: + # Use appropriate models for design analysis + default_models = { + ModelProvider.OPENAI: "gpt-4o", + ModelProvider.GEMINI: "gemini-2.0-flash", + ModelProvider.ANTHROPIC: "claude-3.5-sonnet-20240620" + } + model_name = default_models.get(provider) + + if not self.is_provider_available(provider): + logger.error(f"Provider {provider} is not available") + return {"error": f"Provider {provider} is not available"} + + # Get specialized prompt + prompt = create_design_analysis_prompt(paper) + + # Process based on provider + try: + analysis = None + + if provider == ModelProvider.OPENAI: + response = openai.ChatCompletion.create( + model=model_name, + messages=[ + {"role": "system", "content": "You are a specialist in AI for design automation."}, + {"role": "user", "content": prompt} + ], + temperature=0.3, + max_tokens=2048 + ) + content = response.choices[0].message.content + analysis = extract_json_from_text(content) + + elif provider == ModelProvider.GEMINI and GEMINI_AVAILABLE: + model = genai.GenerativeModel(model_name) + response = model.generate_content(prompt) + content = response.text + analysis = extract_json_from_text(content) + + elif provider == ModelProvider.ANTHROPIC and ANTHROPIC_AVAILABLE: + response = self.anthropic_client.messages.create( + model=model_name, + max_tokens=2048, + temperature=0.3, + system="You are a specialist in AI for design automation.", + messages=[ + {"role": "user", "content": prompt} + ] + ) + content = response.content[0].text + analysis = extract_json_from_text(content) + + # Enhance analysis with design capabilities if successful + if analysis and "error" not in analysis: + capabilities = extract_design_capabilities(analysis) + analysis["capabilities"] = capabilities + + return analysis or {"error": "Failed to generate analysis"} + + except Exception as e: + logger.error(f"Error analyzing design automation paper: {e}") + return {"error": f"Analysis error: {str(e)}"} + +# Create a singleton instance +model_manager = ModelManager() \ No newline at end of file diff --git a/src/paths.py b/src/paths.py new file mode 100644 index 0000000..911779e --- /dev/null +++ b/src/paths.py @@ -0,0 +1,17 @@ +""" +Common path definitions for ArxivDigest-extra. +This module provides consistent paths throughout the application. +""" +import os + +# Get the project root directory +ROOT_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__))) + +# Define common directories +DATA_DIR = os.path.join(ROOT_DIR, "data") +DIGEST_DIR = os.path.join(ROOT_DIR, "digest") +SRC_DIR = os.path.join(ROOT_DIR, "src") + +# Create directories if they don't exist +for directory in [DATA_DIR, DIGEST_DIR]: + os.makedirs(directory, exist_ok=True) \ No newline at end of file diff --git a/src/relevancy.py b/src/relevancy.py index 5cef09a..fb2c69c 100644 --- a/src/relevancy.py +++ b/src/relevancy.py @@ -1,7 +1,6 @@ """ run: python -m relevancy run_all_day_paper \ - --output_dir ./data \ --model_name="gpt-3.5-turbo-16k" \ """ import time @@ -16,10 +15,26 @@ import tqdm import utils +from paths import DATA_DIR -def encode_prompt(query, prompt_papers): - """Encode multiple prompt instructions into a single string.""" - prompt = open("src/relevancy_prompt.txt").read() + "\n" + +def encode_prompt(query, prompt_papers, include_content=True): + """ + Encode multiple prompt instructions into a single string. + + Args: + query: Dictionary with interest field + prompt_papers: List of paper dictionaries + include_content: Whether to include the full content field (False for stage 1 filtering) + """ + # Use different prompt templates for each stage + if include_content: + # Stage 2: Full analysis with content + prompt = open("src/relevancy_prompt.txt").read() + "\n" + else: + # Stage 1: Quick relevancy scoring with just title and abstract + prompt = open("src/relevancy_filter_prompt.txt").read() + "\n" + prompt += query['interest'] for idx, task_dict in enumerate(prompt_papers): @@ -30,51 +45,246 @@ def encode_prompt(query, prompt_papers): prompt += f"{idx + 1}. Title: {title}\n" prompt += f"{idx + 1}. Authors: {authors}\n" prompt += f"{idx + 1}. Abstract: {abstract}\n" + + # Only include content in stage 2 + if include_content and "content" in task_dict: + content = task_dict["content"] + prompt += f"{idx + 1}. Content: {content}\n" + prompt += f"\n Generate response:\n1." - print(prompt) + + # Just log the number of papers and stage information + num_papers = len(prompt_papers) + stage = "Stage 2 (full analysis)" if include_content else "Stage 1 (relevancy filtering)" + print(f"Sending prompt for {stage} with {num_papers} papers") + return prompt -def post_process_chat_gpt_response(paper_data, response, threshold_score=8): +def is_json(myjson): + try: + json.loads(myjson) + except Exception as e: + return False + return True + +def extract_json_from_string(text): + """ + Improved JSON extraction that can handle multiple JSON objects in different formats + """ + # Clean up the text - remove markdown code blocks and backticks + text = text.replace("```json", "").replace("```", "").strip() + + # Try to find all JSON objects in the text + json_objects = [] + + # First, try to split by numbered lines (1., 2., etc.) + numbered_pattern = re.compile(r'^\d+\.\s*(\{.*?\})', re.DOTALL | re.MULTILINE) + numbered_matches = numbered_pattern.findall(text) + + if numbered_matches: + # Found numbered JSON objects + for json_str in numbered_matches: + try: + parsed = json.loads(json_str) + json_objects.append(parsed) + except json.JSONDecodeError: + pass + + # If we didn't find numbered objects, look for direct JSON objects + if not json_objects: + # Find all potential JSON objects + stack = [] + start_indices = [] + + for i, char in enumerate(text): + if char == '{' and (not stack): + start_indices.append(i) + stack.append(char) + elif char == '{': + stack.append(char) + elif char == '}' and stack: + stack.pop() + if not stack: + # Found a complete JSON object + json_str = text[start_indices.pop():i+1] + try: + parsed = json.loads(json_str) + json_objects.append(parsed) + except json.JSONDecodeError: + pass + + print(f"Found {len(json_objects)} JSON objects in the response") + return json_objects + +def post_process_chat_gpt_response(paper_data, response, threshold_score=0): + """ + Completely rewritten parsing function that handles the OpenAI response better + """ selected_data = [] if response is None: - return [] - json_items = response['message']['content'].replace("\n\n", "\n").split("\n") - pattern = r"^\d+\. |\\" - import pprint - try: - score_items = [ - json.loads(re.sub(pattern, "", line)) - for line in json_items if "relevancy score" in line.lower()] - except Exception: - pprint.pprint([re.sub(pattern, "", line) for line in json_items if "relevancy score" in line.lower()]) - raise RuntimeError("failed") - pprint.pprint(score_items) - scores = [] - for item in score_items: - temp = item["Relevancy score"] - if isinstance(temp, str) and "/" in temp: - scores.append(int(temp.split("/")[0])) - else: - scores.append(int(temp)) - if len(score_items) != len(paper_data): + print("Response is None") + return [], False + + # Handle both old and new API response formats + if isinstance(response, dict) and 'message' in response: + # Old API format + content = response['message']['content'] + elif hasattr(response, 'choices') and len(response.choices) > 0: + # New API format (OpenAI Client) + content = response.choices[0].message.content + else: + # Fallback to dictionary access + try: + content = response.get('choices', [{}])[0].get('message', {}).get('content', '') + except Exception: + content = '' + + if not content: + print("Content is empty") + return [], False + + # Print the raw content for debugging + print(f"\nRaw content:\n{content}\n") + + # Try to extract multiple JSON objects from the content + json_objects = extract_json_from_string(content) + + if json_objects: + # Found JSON objects using our improved extractor + score_items = [] + for obj in json_objects: + if "Relevancy score" in obj or "relevancy score" in obj: + # Normalize key names (handle case sensitivity) + normalized_obj = {} + for key, value in obj.items(): + if key.lower() == "relevancy score": + normalized_obj["Relevancy score"] = value + else: + normalized_obj[key] = value + score_items.append(normalized_obj) + else: + # Fallback to older parsing method + score_items = [] + json_items = content.replace("\n\n", "\n").split("\n") + pattern = r"^\d+\. |\\" + + for line in json_items: + if is_json(line) and "relevancy score" in line.lower(): + try: + parsed_item = json.loads(re.sub(pattern, "", line)) + score_items.append(parsed_item) + except: + pass + + print(f"Found {len(score_items)} score items from response") + + # If we have no score items but have paper data, create default ones + if len(score_items) == 0 and len(paper_data) > 0: + print("Creating default score items for each paper") + score_items = [] + for i in range(len(paper_data)): + # Create a default item with a mid-range score + score_items.append({ + "Relevancy score": 5, + "Reasons for match": "Default score assigned due to parsing issues.", + "Key innovations": "Not available in analysis", + "Critical analysis": "Not available in analysis", + "Goal": "Not available in analysis", + "Data": "Not available in analysis", + "Methodology": "Not available in analysis", + "Implementation details": "Not available in analysis", + "Experiments & Results": "Not available in analysis", + "Git": "Not available in analysis", + "Discussion & Next steps": "Not available in analysis", + "Related work": "Not available in analysis", + "Practical applications": "Not available in analysis", + "Key takeaways": "Not available in analysis" + }) + + # Truncate score_items if needed + if len(score_items) > len(paper_data): + print(f"WARNING: More score items ({len(score_items)}) than papers ({len(paper_data)})") score_items = score_items[:len(paper_data)] hallucination = True else: hallucination = False + # Define expected analysis fields we want to ensure are copied to the paper objects + analysis_fields = [ + "Relevancy score", "Reasons for match", "Key innovations", "Critical analysis", + "Goal", "Data", "Methodology", "Implementation details", "Experiments & Results", + "Git", "Discussion & Next steps", "Related work", "Practical applications", + "Key takeaways" + ] + + print(f"DEBUG: Processing {len(score_items)} score items for {len(paper_data)} papers") + + # If we don't have any score items but have papers, something went wrong with parsing + if len(score_items) == 0 and len(paper_data) > 0: + print("WARNING: No score items were found, but papers exist. Check JSON parsing.") + # Create fallback score items with default score to prevent empty results + for i in range(len(paper_data)): + fallback_item = { + "Relevancy score": threshold_score, # Set to threshold score to ensure it passes filter + "Reasons for match": "Automatically assigned threshold score due to parsing issues." + } + score_items.append(fallback_item) + + # Ensure we have at least one paper if there are score items for idx, inst in enumerate(score_items): - # if the decoding stops due to length, the last example is likely truncated so we discard it - if scores[idx] < threshold_score: + if idx >= len(paper_data): + print(f"DEBUG: Index {idx} out of range for paper_data (length {len(paper_data)})") continue - output_str = "Title: " + paper_data[idx]["title"] + "\n" + + # Get the relevancy score + relevancy_score = inst.get('Relevancy score', 0) + if isinstance(relevancy_score, str): + try: + # Try to convert string score to integer + if '/' in relevancy_score: + relevancy_score = int(relevancy_score.split('/')[0]) + else: + relevancy_score = int(relevancy_score) + except (ValueError, TypeError): + relevancy_score = threshold_score # Default to threshold if conversion fails + + print(f"DEBUG: Processing paper {idx+1} with score {relevancy_score}") + + # Only process papers that meet the threshold + if relevancy_score < threshold_score: + print(f"DEBUG: Skipping paper {idx+1} with score {relevancy_score} < threshold {threshold_score}") + continue + + # Create detailed output string for logging and console display + output_str = "Subject: " + paper_data[idx]["subjects"] + "\n" + output_str += "Title: " + paper_data[idx]["title"] + "\n" output_str += "Authors: " + paper_data[idx]["authors"] + "\n" output_str += "Link: " + paper_data[idx]["main_page"] + "\n" + + # Copy all fields from the analysis to the paper object for key, value in inst.items(): paper_data[idx][key] = value output_str += str(key) + ": " + str(value) + "\n" + + # Ensure all expected analysis fields are present in the paper object + # This ensures fields used in the HTML template like "Key innovations" are set + for field in analysis_fields: + if field in inst: + # Double-check the field got copied (should be redundant with the loop above) + paper_data[idx][field] = inst[field] + print(f"Found and copied field: {field}") + else: + print(f"Missing analysis field: {field}") + paper_data[idx][field] = "Not available in analysis" + paper_data[idx]['summarized_text'] = output_str selected_data.append(paper_data[idx]) + print(f"DEBUG: Added paper {idx+1} to selected_data (now has {len(selected_data)} papers)") + + print(f"DEBUG: Selected papers count: {len(selected_data)}") + print(f"DEBUG: Paper fields: {list(selected_data[0].keys()) if selected_data else 'No papers'}") + return selected_data, hallucination @@ -87,30 +297,36 @@ def process_subject_fields(subjects): all_subjects = [s.split(" (")[0] for s in all_subjects] return all_subjects -def generate_relevance_score( +def filter_papers_by_relevance( all_papers, query, model_name="gpt-3.5-turbo-16k", - threshold_score=8, - num_paper_in_prompt=4, - temperature=0.4, + threshold_score=2, + num_paper_in_prompt=8, # Fixed at 8 papers per prompt as requested + temperature=0.3, # Lower temperature for more consistent relevancy scoring top_p=1.0, - sorting=True + max_papers=10 # Try to find at least this many papers that meet the threshold ): - ans_data = [] - request_idx = 1 - hallucination = False - for id in tqdm.tqdm(range(0, len(all_papers), num_paper_in_prompt)): - prompt_papers = all_papers[id:id+num_paper_in_prompt] - # only sampling from the seed tasks - prompt = encode_prompt(query, prompt_papers) - + """ + Stage 1: Filter papers by relevance using only title and abstract + Returns only papers that meet or exceed the threshold score + """ + filtered_papers = [] + print(f"\n===== STAGE 1: FILTERING PAPERS BY RELEVANCE (THRESHOLD >= {threshold_score}) =====") + + for id in tqdm.tqdm(range(0, len(all_papers), num_paper_in_prompt), desc="Stage 1: Relevancy filtering"): + batch_papers = all_papers[id:id+num_paper_in_prompt] + + # Create prompt without content for quick relevancy filtering + prompt = encode_prompt(query, batch_papers, include_content=False) + decoding_args = utils.OpenAIDecodingArguments( temperature=temperature, n=1, - max_tokens=128*num_paper_in_prompt, # The response for each paper should be less than 128 tokens. + max_tokens=512, # Less tokens needed for just scoring top_p=top_p, ) + request_start = time.time() response = utils.openai_completion( prompts=prompt, @@ -119,29 +335,263 @@ def generate_relevance_score( decoding_args=decoding_args, logit_bias={"100257": -100}, # prevent the <|endoftext|> from being generated ) - print ("response", response['message']['content']) + request_duration = time.time() - request_start - + print(f"Stage 1 batch took {request_duration:.2f}s") + + # Extract just the relevancy scores process_start = time.time() - batch_data, hallu = post_process_chat_gpt_response(prompt_papers, response, threshold_score=threshold_score) - hallucination = hallucination or hallu - ans_data.extend(batch_data) + batch_data, _ = post_process_chat_gpt_response( + batch_papers, + response, + threshold_score=0 # Don't filter yet, we want all scores + ) + + # Keep only papers that meet or exceed the threshold + # Make sure we have the same number of scores as papers + if len(batch_data) != len(batch_papers): + print(f"WARNING: Mismatch between batch_data ({len(batch_data)}) and batch_papers ({len(batch_papers)})") + # If we have different counts, we need to match papers to scores + # This handles cases where not all papers got scores + + # Create a map of titles to papers for easier lookup + title_to_paper = {p["title"]: p for p in batch_papers} + + # Match scores to papers + for paper in batch_data: + if "title" in paper and paper["title"] in title_to_paper: + # Found a match by title + relevancy_score = paper.get("Relevancy score", 0) + if isinstance(relevancy_score, str): + try: + if '/' in relevancy_score: + relevancy_score = int(relevancy_score.split('/')[0]) + else: + relevancy_score = int(relevancy_score) + except (ValueError, TypeError): + relevancy_score = 0 + + if relevancy_score >= threshold_score: + print(f"PASSED: Paper '{paper['title'][:50]}...' with score {relevancy_score}") + filtered_papers.append(paper) + else: + print(f"FILTERED OUT: Paper '{paper['title'][:50]}...' with score {relevancy_score}") + else: + # We have the expected number of scores + for paper in batch_data: + relevancy_score = paper.get("Relevancy score", 0) + if isinstance(relevancy_score, str): + try: + if '/' in relevancy_score: + relevancy_score = int(relevancy_score.split('/')[0]) + else: + relevancy_score = int(relevancy_score) + except (ValueError, TypeError): + relevancy_score = 0 + + if relevancy_score >= threshold_score: + print(f"PASSED: Paper '{paper['title'][:50]}...' with score {relevancy_score}") + filtered_papers.append(paper) + else: + print(f"FILTERED OUT: Paper '{paper['title'][:50]}...' with score {relevancy_score}") + + print(f"Post-processing took {time.time() - process_start:.2f}s") + print(f"Filtered papers so far: {len(filtered_papers)} out of {id + len(batch_papers)}") + + print(f"\nStage 1 complete: {len(filtered_papers)} papers met the threshold of {threshold_score} out of {len(all_papers)}") + + # If we didn't find enough papers, adjust threshold downward and include more + if len(filtered_papers) < max_papers and threshold_score > 1: + # Find the highest-scored papers that didn't meet the threshold + remaining_scores = {} + for paper in all_papers: + if paper not in filtered_papers: + score = paper.get("Relevancy score", 0) + if isinstance(score, str): + try: + score = int(score) + except (ValueError, TypeError): + score = 0 + remaining_scores[paper] = score + + # Sort the remaining papers by score (descending) + sorted_papers = sorted(remaining_scores.keys(), key=lambda p: remaining_scores[p], reverse=True) + + # Add the highest-scored papers until we reach max_papers or run out of papers + papers_to_add = sorted_papers[:max_papers - len(filtered_papers)] + for paper in papers_to_add: + score = remaining_scores[paper] + print(f"Adding paper '{paper['title'][:50]}...' with score {score} (below threshold) to meet minimum paper count") + filtered_papers.append(paper) + + print(f"Added {len(papers_to_add)} papers below threshold to reach {len(filtered_papers)} total papers") + + return filtered_papers + - print(f"Request {request_idx+1} took {request_duration:.2f}s") +def analyze_papers_in_depth( + filtered_papers, + query, + model_name="gemini-1.5-flash", # Use Gemini by default for detailed analysis + num_paper_in_prompt=5, # Smaller batches for detailed analysis + temperature=0.5, + top_p=1.0 +): + """ + Stage 2: Analyze papers in depth, including content analysis + Only called for papers that passed the relevancy threshold + """ + analyzed_papers = [] + print(f"\n===== STAGE 2: DETAILED ANALYSIS OF {len(filtered_papers)} PAPERS =====") + + # If we're using Gemini, use their API instead + if "gemini" in model_name: + print(f"Using Gemini for detailed analysis: {model_name}") + from gemini_utils import analyze_papers_with_gemini + return analyze_papers_with_gemini( + filtered_papers, + query=query, + model_name=model_name + ) + + # Otherwise use OpenAI + for id in tqdm.tqdm(range(0, len(filtered_papers), num_paper_in_prompt), desc="Stage 2: Detailed analysis"): + batch_papers = filtered_papers[id:id+num_paper_in_prompt] + + # Create prompt with content for detailed analysis + prompt = encode_prompt(query, batch_papers, include_content=True) + + decoding_args = utils.OpenAIDecodingArguments( + temperature=temperature, + n=1, + max_tokens=1024*num_paper_in_prompt, + top_p=top_p, + ) + + request_start = time.time() + response = utils.openai_completion( + prompts=prompt, + model_name=model_name, + batch_size=1, + decoding_args=decoding_args, + logit_bias={"100257": -100}, # prevent the <|endoftext|> from being generated + ) + + request_duration = time.time() - request_start + print(f"Stage 2 batch took {request_duration:.2f}s") + + # Process the detailed analysis + process_start = time.time() + batch_data, _ = post_process_chat_gpt_response(batch_papers, response, threshold_score=0) + analyzed_papers.extend(batch_data) + print(f"Post-processing took {time.time() - process_start:.2f}s") + print(f"Analyzed papers so far: {len(analyzed_papers)} out of {len(filtered_papers)}") + + print(f"\nStage 2 complete: {len(analyzed_papers)} papers fully analyzed") + return analyzed_papers + - if sorting: - ans_data = sorted(ans_data, key=lambda x: int(x["Relevancy score"]), reverse=True) +def generate_relevance_score( + all_papers, + query, + model_name="gpt-3.5-turbo-16k", + threshold_score=2, + num_paper_in_prompt=8, # Fixed at 8 papers per prompt + temperature=0.4, + top_p=1.0, + sorting=True, + stage2_model="gemini-1.5-flash", # Model to use for Stage 2 + min_papers=10 # Minimum number of papers to return +): + """ + Two-stage paper processing: + 1. Filter papers by relevance using OpenAI (fast, based on title/abstract) + 2. Analyze relevant papers in depth using Gemini (detailed, includes content) + """ + # Stage 1: Filter by relevance (OpenAI) + filtered_papers = filter_papers_by_relevance( + all_papers, + query, + model_name=model_name, + threshold_score=threshold_score, + num_paper_in_prompt=num_paper_in_prompt, + temperature=temperature, + top_p=top_p, + max_papers=min_papers # Ensure we get at least this many papers + ) + + # If no papers passed the threshold, return empty results + if len(filtered_papers) == 0: + print("No papers passed the relevance threshold. Returning empty results.") + return [], False + + # Before Stage 2: Extract HTML content for papers that passed the filter + print(f"\n===== EXTRACTING HTML CONTENT FOR {len(filtered_papers)} PAPERS =====") + for i, paper in enumerate(filtered_papers): + try: + # Extract HTML content from the paper URL + from download_new_papers import crawl_html_version + + # Get the paper ID from the main_page URL + paper_id = None + main_page = paper.get("main_page", "") + if main_page: + # Extract paper ID (e.g., 2401.12345) + import re + id_match = re.search(r'/abs/([0-9v.]+)', main_page) + if id_match: + paper_id = id_match.group(1) + + if paper_id: + # Construct HTML link + html_link = f"https://arxiv.org/html/{paper_id}" + print(f"Fetching HTML content for paper {i+1}/{len(filtered_papers)}: {paper['title'][:50]}...") + print(f"HTML link: {html_link}") + + # Try to get content + content = crawl_html_version(html_link) + if content and len(content) > 100 and "Error accessing HTML" not in content: + paper["content"] = content + print(f"βœ… Successfully extracted {len(content)} characters of content") + else: + # If HTML version fails, use the abstract + more details + paper["content"] = f"{paper.get('abstract', '')} {paper.get('title', '')}" + print(f"⚠️ Failed to extract content, using abstract instead. Error: {content[:100]}...") + time.sleep(3) + else: + print(f"⚠️ Couldn't parse paper ID from URL: {main_page}") + paper["content"] = paper.get("abstract", "No content available") + + except Exception as e: + print(f"❌ Error extracting HTML content: {str(e)}") + # Fallback to using the abstract + paper["content"] = paper.get("abstract", "No content available") + + print(f"Content extraction complete for {len(filtered_papers)} papers.") + + # Stage 2: In-depth analysis (Gemini or fallback to OpenAI) + analyzed_papers = analyze_papers_in_depth( + filtered_papers, + query, + model_name=stage2_model, + num_paper_in_prompt=max(1, num_paper_in_prompt // 2), # Smaller batches for detailed analysis + temperature=temperature, + top_p=top_p + ) + + # Sort by relevancy score if requested + if sorting and analyzed_papers: + analyzed_papers = sorted(analyzed_papers, key=lambda x: int(x.get("Relevancy score", 0)), reverse=True) - return ans_data, hallucination + return analyzed_papers, False # No hallucination tracking in two-stage system def run_all_day_paper( - query={"interest":"", "subjects":["Computation and Language", "Artificial Intelligence"]}, + query={"interest":"Computer Science", "subjects":["Machine Learning", "Computation and Language", "Artificial Intelligence", "Information Retrieval"]}, date=None, - data_dir="../data", model_name="gpt-3.5-turbo-16k", - threshold_score=8, - num_paper_in_prompt=8, + threshold_score=7, + num_paper_in_prompt=2, temperature=0.4, top_p=1.0 ): @@ -150,7 +600,8 @@ def run_all_day_paper( # string format such as Wed, 10 May 23 print ("the date for the arxiv data is: ", date) - all_papers = [json.loads(l) for l in open(f"{data_dir}/{date}.jsonl", "r")] + file_path = os.path.join(DATA_DIR, f"{date}.jsonl") + all_papers = [json.loads(l) for l in open(file_path, "r")] print (f"We found {len(all_papers)}.") all_papers_in_subjects = [ @@ -159,7 +610,8 @@ def run_all_day_paper( ] print(f"After filtering subjects, we have {len(all_papers_in_subjects)} papers left.") ans_data = generate_relevance_score(all_papers_in_subjects, query, model_name, threshold_score, num_paper_in_prompt, temperature, top_p) - utils.write_ans_to_file(ans_data, date, output_dir="../outputs") + from paths import DIGEST_DIR + utils.write_ans_to_file(ans_data, date, output_dir=DIGEST_DIR) return ans_data diff --git a/src/relevancy_filter_prompt.txt b/src/relevancy_filter_prompt.txt new file mode 100644 index 0000000..b047a70 --- /dev/null +++ b/src/relevancy_filter_prompt.txt @@ -0,0 +1,23 @@ +You are a research assistant with expertise in analyzing academic papers, particularly in AI and machine learning. You've been asked to perform PRELIMINARY SCREENING of arXiv papers based ONLY on their titles and abstracts. + +Your task is to evaluate which papers are worth analyzing in depth based on their potential relevance to the researcher's specific interests. + +For each paper, provide ONLY a relevancy score out of 10, with a higher score indicating greater relevance to the researcher's specific interests. Each paper's score should be accompanied by a brief explanation of why it matches or doesn't match the research interests. + +Papers scoring 7 or higher will undergo detailed analysis with their full content, so be selective. + +VERY IMPORTANT: Respond with a numbered list of valid JSON objects. The format MUST be exactly like this for each paper: + +1. { + "Relevancy score": 7, + "Reasons for match": "Paper discusses multi-agent systems with focus on coordination mechanisms, which directly aligns with research interests." +} + +2. { + "Relevancy score": 3, + "Reasons for match": "Mentions agents but focuses on image processing applications, which is not part of the stated research interests." +} + +DO NOT use "```json" code blocks or any other formatting. Just provide numbered JSON objects exactly as shown above. + +My research interests are: \ No newline at end of file diff --git a/src/relevancy_prompt.txt b/src/relevancy_prompt.txt index fb413c4..d94a076 100644 --- a/src/relevancy_prompt.txt +++ b/src/relevancy_prompt.txt @@ -1,7 +1,26 @@ -You have been asked to read a list of a few arxiv papers, each with title, authors and abstract. -Based on my specific research interests, elevancy score out of 10 for each paper, based on my specific research interest, with a higher score indicating greater relevance. A relevance score more than 7 will need person's attention for details. -Additionally, please generate 1-2 sentence summary for each paper explaining why it's relevant to my research interests. -Please keep the paper order the same as in the input list, with one json format per line. Example is: -1. {"Relevancy score": "an integer score out of 10", "Reasons for match": "1-2 sentence short reasonings"} +You are a research assistant with expertise in analyzing academic papers, particularly in AI and machine learning. You've been asked to thoroughly analyze a list of arXiv papers, each with title, authors, abstract, and content. -My research interests are: \ No newline at end of file +For each paper, provide: +1. A relevancy score out of 10 based on my specific research interests, with a higher score indicating greater relevance. A score of 7 or higher means this paper deserves special attention. +2. A comprehensive analysis that would help me understand the paper's value and contributions without having to read the entire paper. + +Please maintain the original paper order in your response, with one JSON object per line. Format: + +1. { + "Relevancy score": "an integer score out of 10", + "Reasons for match": "A detailed paragraph explaining why this paper aligns with my research interests, highlighting specific concepts, methodologies, or findings that match my interests", + "Key innovations": "2-3 bullet points describing the main contributions and what makes this paper novel", + "Critical analysis": "A thoughtful paragraph evaluating the strengths and potential limitations of the approach", + "Goal": "What specific problem or research gap does this paper address?", + "Data": "Detailed description of datasets used, including size, characteristics, and any novel data processing techniques", + "Methodology": "Comprehensive explanation of the methods, algorithms, and technical approach", + "Implementation details": "Information about model architecture, hyperparameters, training procedures, and computational requirements", + "Git": "Link to code repository if available, or note if code is not yet released", + "Experiments & Results": "Analysis of experimental setup, key results, and how they compare to prior work or baselines", + "Discussion & Next steps": "The authors' own conclusions, limitations they identified, and future research directions", + "Related work": "How this paper relates to similar recent papers in the field", + "Practical applications": "How the findings could be applied in real-world scenarios", + "Key takeaways": "3-5 bullet points summarizing the most important insights from this paper" +} + +My research interests are: AI Alignment, AI safety, Mechanistic Interpretability, Explainable AI, RAGs, Information Retrieval, Large Language Models, Multimodal Learning, Generative AI, Optimization in LLM, Model Efficiency, Fine-tuning Techniques, Prompt Engineering, and AI Evaluation Metrics. \ No newline at end of file diff --git a/src/utils.py b/src/utils.py index c128702..70bc2bc 100644 --- a/src/utils.py +++ b/src/utils.py @@ -6,14 +6,20 @@ import sys import time import json -from typing import Optional, Sequence, Union +from typing import Optional, Sequence, Union, Dict, Any import openai import tqdm -from openai import openai_object import copy -StrOrOpenAIObject = Union[str, openai_object.OpenAIObject] +# Handle both old and new OpenAI SDK versions +try: + from openai import openai_object + StrOrOpenAIObject = Union[str, openai_object.OpenAIObject] + OPENAI_OLD_API = True +except ImportError: + StrOrOpenAIObject = Union[str, Dict[str, Any]] + OPENAI_OLD_API = False openai_org = os.getenv("OPENAI_ORG") @@ -24,7 +30,8 @@ @dataclasses.dataclass class OpenAIDecodingArguments(object): - max_tokens: int = 1800 + #max_tokens: int = 1800 + max_tokens: int = 5400 temperature: float = 0.2 top_p: float = 1.0 n: int = 1 @@ -39,7 +46,7 @@ def openai_completion( prompts, #: Union[str, Sequence[str], Sequence[dict[str, str]], dict[str, str]], decoding_args: OpenAIDecodingArguments, model_name="text-davinci-003", - sleep_time=2, + sleep_time=15, batch_size=1, max_instances=sys.maxsize, max_batches=sys.maxsize, @@ -96,34 +103,97 @@ def openai_completion( ): batch_decoding_args = copy.deepcopy(decoding_args) # cloning the decoding_args - backoff = 3 + backoff = 5 while True: try: + time.sleep(3) shared_kwargs = dict( model=model_name, **batch_decoding_args.__dict__, **decoding_kwargs, ) - if is_chat_model: - completion_batch = openai.ChatCompletion.create( - messages=[ - {"role": "system", "content": "You are a helpful assistant."}, - {"role": "user", "content": prompt_batch[0]} - ], - **shared_kwargs - ) + + if OPENAI_OLD_API: + # Use old API format + if is_chat_model: + completion_batch = openai.ChatCompletion.create( + messages=[ + {"role": "system", "content": "You are a helpful assistant."}, + {"role": "user", "content": prompt_batch[0]} + ], + **shared_kwargs + ) + else: + completion_batch = openai.Completion.create(prompt=prompt_batch, **shared_kwargs) + + choices = completion_batch.choices + + for choice in choices: + choice["total_tokens"] = completion_batch.usage.total_tokens else: - completion_batch = openai.Completion.create(prompt=prompt_batch, **shared_kwargs) - - choices = completion_batch.choices - - for choice in choices: - choice["total_tokens"] = completion_batch.usage.total_tokens + # Use new API format + client = openai.OpenAI() + + if is_chat_model: + completion_batch = client.chat.completions.create( + model=model_name, + messages=[ + {"role": "system", "content": "You are a helpful assistant."}, + {"role": "user", "content": prompt_batch[0]} + ], + temperature=batch_decoding_args.temperature, + max_tokens=batch_decoding_args.max_tokens, + top_p=batch_decoding_args.top_p, + n=batch_decoding_args.n, + stream=batch_decoding_args.stream, + presence_penalty=batch_decoding_args.presence_penalty, + frequency_penalty=batch_decoding_args.frequency_penalty, + **decoding_kwargs + ) + + # Convert completion to dictionary format for consistency + choices = [] + for choice in completion_batch.choices: + choice_dict = { + "message": { + "content": choice.message.content, + "role": choice.message.role + }, + "index": choice.index, + "finish_reason": choice.finish_reason, + "total_tokens": completion_batch.usage.total_tokens + } + choices.append(choice_dict) + else: + completion_batch = client.completions.create( + model=model_name, + prompt=prompt_batch, + temperature=batch_decoding_args.temperature, + max_tokens=batch_decoding_args.max_tokens, + top_p=batch_decoding_args.top_p, + n=batch_decoding_args.n, + stream=batch_decoding_args.stream, + presence_penalty=batch_decoding_args.presence_penalty, + frequency_penalty=batch_decoding_args.frequency_penalty, + **decoding_kwargs + ) + + # Convert completion to dictionary format for consistency + choices = [] + for choice in completion_batch.choices: + choice_dict = { + "text": choice.text, + "index": choice.index, + "finish_reason": choice.finish_reason, + "total_tokens": completion_batch.usage.total_tokens + } + choices.append(choice_dict) + completions.extend(choices) break - except openai.error.OpenAIError as e: - logging.warning(f"OpenAIError: {e}.") + except Exception as e: + logging.warning(f"OpenAI API Error: {e}.") if "Please reduce your prompt" in str(e): batch_decoding_args.max_tokens = int(batch_decoding_args.max_tokens * 0.8) logging.warning(f"Reducing target length to {batch_decoding_args.max_tokens}, Retrying...") @@ -134,9 +204,14 @@ def openai_completion( backoff -= 1 logging.warning("Hit request rate limit; retrying...") time.sleep(sleep_time) # Annoying rate limit on requests. + continue if return_text: - completions = [completion.text for completion in completions] + if is_chat_model: + completions = [completion.get("message", {}).get("content", "") for completion in completions] + else: + completions = [completion.get("text", "") for completion in completions] + if decoding_args.n > 1: # make completions a nested list, where each entry is a consecutive decoding_args.n of original entries. completions = [completions[i : i + decoding_args.n] for i in range(0, len(completions), decoding_args.n)]