This project aims to demonstrate a complete data science workflow, from data loading and cleaning to analysis, visualization, and building an interactive web application. The primary learning objectives include:
- Proficiency in using Python libraries like Pandas, Matplotlib, Seaborn, and Streamlit.
- Understanding and implementing data cleaning and preparation techniques.
- Performing exploratory data analysis and generating meaningful visualizations.
- Building an interactive web application to showcase data insights.
- Adhering to professional Git practices for version control.
The dataset used in this project is metadata.csv from the CORD-19 (COVID-19 Open Research Dataset) Research Challenge available on Kaggle.
- Downloaded and loaded
metadata.csvinto a Pandas DataFrame. - Displayed initial rows, DataFrame shape, and column data types.
- Checked for missing values across all columns.
- Generated basic descriptive statistics for numerical columns.
- Handled missing values by dropping rows with missing
abstract,publish_time,title, orjournal. - Converted the
publish_timecolumn to datetime objects and extracted theyearof publication. - Created a new derived feature:
abstract_word_count.
- Counted the number of papers published each year.
- Identified the top 10 journals by publication count.
- Performed a simple word frequency analysis on paper titles to generate a word cloud.
- Created the following visualizations:
- Line plot showing the number of publications over time.
- Bar chart displaying the top 10 publishing journals.
- Word cloud visualizing common terms in paper titles.
- Bar chart showing the distribution of paper counts by source (journal used as fallback if
source_xis not available).
- All visualizations were generated using Matplotlib and Seaborn, saved as PNG files in a
plots/directory.
- Developed an interactive web application (
app.py) using Streamlit. - The app features:
- A clear title, description, and explanation of its purpose.
- Interactive widgets: a slider for filtering by publication year range and a dropdown for selecting specific journals.
- Dynamic display of the generated visualizations based on user selections.
- A sample table showing the head of the filtered dataset.
- Python 3.8+
- Git
metadata.csvfile downloaded from the CORD-19 Kaggle dataset and placed in the project root directory.
- Clone the repository:
git clone <YOUR_GITHUB_REPO_URL> cd Frameworks_Assignment
- Create and activate a virtual environment:
python3 -m venv venv # On Linux/macOS: source venv/bin/activate # On Windows (Command Prompt): # venv\Scripts\activate.bat # On Windows (PowerShell): # venv\Scripts\Activate.ps1
- Install dependencies:
pip install -r requirements.txt
- Place
metadata.csv: Ensure themetadata.csvfile (downloaded from Kaggle) is placed directly in theFrameworks_Assignmentdirectory.
To run the analysis.py script and generate the plots (saved in the plots/ directory):
python3 analysis.pyTo start the interactive Streamlit web application:
streamlit run app.pyOpen your web browser and navigate to the URL provided by Streamlit (usually http://localhost:8501).
- The CORD-19 dataset is extensive, requiring robust data cleaning for effective analysis.
- Publication trends show a significant increase in research output over recent years, especially around the COVID-19 pandemic period.
- Certain journals consistently publish a high volume of research, indicating their prominence in the field.
- Word clouds provide a quick visual summary of prevalent topics in research paper titles.
- Streamlit offers a powerful and straightforward way to transform static analyses into interactive web applications, making insights more accessible to a broader audience.
- Managing Python environments with
venvis crucial for dependency management and avoiding conflicts.