Skip to content

mokwathedeveloper/Frameworks_Assignment

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

CORD-19 Research Paper Analysis

Project Purpose and Learning Objectives

This project aims to demonstrate a complete data science workflow, from data loading and cleaning to analysis, visualization, and building an interactive web application. The primary learning objectives include:

  • Proficiency in using Python libraries like Pandas, Matplotlib, Seaborn, and Streamlit.
  • Understanding and implementing data cleaning and preparation techniques.
  • Performing exploratory data analysis and generating meaningful visualizations.
  • Building an interactive web application to showcase data insights.
  • Adhering to professional Git practices for version control.

Dataset Source

The dataset used in this project is metadata.csv from the CORD-19 (COVID-19 Open Research Dataset) Research Challenge available on Kaggle.

Steps Implemented

1. Data Loading & Basic Exploration

  • Downloaded and loaded metadata.csv into a Pandas DataFrame.
  • Displayed initial rows, DataFrame shape, and column data types.
  • Checked for missing values across all columns.
  • Generated basic descriptive statistics for numerical columns.

2. Data Cleaning & Preparation

  • Handled missing values by dropping rows with missing abstract, publish_time, title, or journal.
  • Converted the publish_time column to datetime objects and extracted the year of publication.
  • Created a new derived feature: abstract_word_count.

3. Data Analysis & Visualization

  • Counted the number of papers published each year.
  • Identified the top 10 journals by publication count.
  • Performed a simple word frequency analysis on paper titles to generate a word cloud.
  • Created the following visualizations:
    • Line plot showing the number of publications over time.
    • Bar chart displaying the top 10 publishing journals.
    • Word cloud visualizing common terms in paper titles.
    • Bar chart showing the distribution of paper counts by source (journal used as fallback if source_x is not available).
  • All visualizations were generated using Matplotlib and Seaborn, saved as PNG files in a plots/ directory.

4. Streamlit Application

  • Developed an interactive web application (app.py) using Streamlit.
  • The app features:
    • A clear title, description, and explanation of its purpose.
    • Interactive widgets: a slider for filtering by publication year range and a dropdown for selecting specific journals.
    • Dynamic display of the generated visualizations based on user selections.
    • A sample table showing the head of the filtered dataset.

Instructions for Running the Project

Prerequisites

  • Python 3.8+
  • Git
  • metadata.csv file downloaded from the CORD-19 Kaggle dataset and placed in the project root directory.

Setup and Installation

  1. Clone the repository:
    git clone <YOUR_GITHUB_REPO_URL>
    cd Frameworks_Assignment
  2. Create and activate a virtual environment:
    python3 -m venv venv
    # On Linux/macOS:
    source venv/bin/activate
    # On Windows (Command Prompt):
    # venv\Scripts\activate.bat
    # On Windows (PowerShell):
    # venv\Scripts\Activate.ps1
  3. Install dependencies:
    pip install -r requirements.txt
  4. Place metadata.csv: Ensure the metadata.csv file (downloaded from Kaggle) is placed directly in the Frameworks_Assignment directory.

Running the Analysis Script

To run the analysis.py script and generate the plots (saved in the plots/ directory):

python3 analysis.py

Running the Streamlit Application

To start the interactive Streamlit web application:

streamlit run app.py

Open your web browser and navigate to the URL provided by Streamlit (usually http://localhost:8501).

Key Findings and Reflections

  • The CORD-19 dataset is extensive, requiring robust data cleaning for effective analysis.
  • Publication trends show a significant increase in research output over recent years, especially around the COVID-19 pandemic period.
  • Certain journals consistently publish a high volume of research, indicating their prominence in the field.
  • Word clouds provide a quick visual summary of prevalent topics in research paper titles.
  • Streamlit offers a powerful and straightforward way to transform static analyses into interactive web applications, making insights more accessible to a broader audience.
  • Managing Python environments with venv is crucial for dependency management and avoiding conflicts.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages