LLM

SBAN Dataset ( code, LLM, cybersecurity)

This paper introduces SBAN (Source code, Binary, Assembly, and Natural Language Description), a large-scale, multi-dimensional dataset designed to advance the pre-training and evaluation of large language models (LLMs) for software code analysis. SBAN comprises more than 3 million samples, including 2.9 million benign and 672,000 malware respectively, each represented across four complementary layers: binary code, assembly instructions, natural language descriptions, and source code.

Categories:

Computational Intelligence

Ex-ToxiCN-MM

Ex-ToxiCN-MM. This dataset offers opposing interpretations, categorized as "harmful" and "non-harmful", for each meme, aiming to rigorously evaluate a model's ability to discern and comprehend ambiguous, culturally grounded content. We built a specialized knowledge base of Chinese cultural concepts and offensive vocabulary to supply models with essential prior knowledge (C-HarmKB).

Categories:

Computational Intelligence

Beyond the GPU: Efficiency, Limitations, and Future Trends in FPGA LLM Inference

Large-language model (LLM) inference is a rapidly growing class of computer workload, with over 100~GW of compute capacity expected to come online in the next 5 years. The most popular chips used for LLM inference are graphics processing units (GPUs), which are expensive and power-intensive. We present a review of the latest literature on LLM inference using field-programmable gate arrays (FPGAs), which are chips that can be programmatically optimized for inference tasks.

Categories:

Artificial Intelligence

Human Responses and LLM-Generated Follow-Up Dialogue for a 54-Item Questionnaire

This dataset contains multistage interview responses (in Russian) collected from 10 human participants who answered 54 primary questions covering life history, personal values, interests, work, hobbies, and future aspirations. Each initial response was evaluated by an LLM-based interviewer agent, which generated a follow-up question whenever the original answer was incomplete or underspecified. Follow-up responses were then produced by a second LLM (GPT-4) simulating the participant’s continuation.

Categories:

Social Sciences

Essays and Evaluations

This data was used in the paper "Harnessing Generative Diversity: An LLM Framework for Consistent Qualitative Assessment."

Categories:

Artificial Intelligence

EarthReason

Remote sensing has become critical for understanding environmental dynamics, urban planning, and disaster management. However, traditional remote sensing workflows often rely on explicit segmentation or detection methods, which struggle to handle complex, implicit queries that require reasoning over spatial context, domain knowledge, and implicit user intent. Motivated by this, we introduce a new task, i.e., geospatial pixel reasoning, which allows implicit querying and reasoning and generates the mask of the target region.

Categories:

Computer Vision

Low-Resource Fine-Tuning for Multi-Task Structured Information Extraction with a 1B Instruction-Tuned Model: Significance Testin

the above is about JSON task

the above is about KGE task

Categories:

Other

Web3-AI Agent Projects

A dataset of Web3-AI Agent projects: `web3-projects.xlsx`

Our comprehensive dataset contains 133 Web3-AI Agent projects collected from three sources: CoinMarketCap, GitHub, and Product Hunt.

Categories:

Artificial Intelligence

HLS-CMDS: Heart and Lung Sounds Dataset Recorded from a Clinical Manikin using Digital Stethoscope

This dataset contains 535 recordings of heart and lung sounds captured using a digital stethoscope from a clinical manikin, including both individual and mixed recordings of heart and lung sounds; 50 heart sounds, 50 lung sounds, and 145 mixed sounds. For each mixed sound, the corresponding source heart sound (145 recordings) and source lung sound (145 recordings) were also recorded. It includes recordings from different anatomical chest locations, with normal and abnormal sounds.

Categories:

Urban Mobility Research Dataset (Generated with the Quanser Interactive Lab)

Fair Use for Academic Research: If you use this dataset, please cite the following paper to ensure proper attribution

M. A. Onsu, P. Lohan, B. Kantarci, A. Syed, M. Andrews, S. Kennedy, "Leveraging Multimodal-LLMs Assisted by Instance Segmentation for Intelligent Traffic Monitoring," 30th IEEE Symposium on Computers and Communications (ISCC), July 2025, Bologna, Italy.

Preprint available here: https://arxiv.org/pdf/2502.11304

Categories:

A dataset of Web3-AI Agent projects: web3-projects.xlsx

A dataset of Web3-AI Agent projects: `web3-projects.xlsx`