Skip to main content

LLM

This paper introduces SBAN (Source code, Binary, Assembly, and Natural Language Description), a large-scale, multi-dimensional dataset designed to advance the pre-training and evaluation of large language models (LLMs) for software code analysis. SBAN comprises more than 3 million samples, including 2.9 million benign and 672,000 malware respectively, each represented across four complementary layers: binary code, assembly instructions, natural language descriptions, and source code.

Categories:

Ex-ToxiCN-MM. This dataset offers opposing interpretations, categorized as "harmful" and "non-harmful", for each meme, aiming to rigorously evaluate a model's ability to discern and comprehend ambiguous, culturally grounded content. We built a specialized knowledge base of Chinese cultural concepts and offensive vocabulary to supply models with essential prior knowledge (C-HarmKB).

Categories:

Large-language model (LLM) inference is a rapidly growing class of computer workload, with over 100~GW of compute capacity expected to come online in the next 5 years. The most popular chips used for LLM inference are graphics processing units (GPUs), which are expensive and power-intensive. We present a review of the latest literature on LLM inference using field-programmable gate arrays (FPGAs), which are chips that can be programmatically optimized for inference tasks.

Categories:

This dataset contains multistage interview responses (in Russian) collected from 10 human participants who answered 54 primary questions covering life history, personal values, interests, work, hobbies, and future aspirations. Each initial response was evaluated by an LLM-based interviewer agent, which generated a follow-up question whenever the original answer was incomplete or underspecified. Follow-up responses were then produced by a second LLM (GPT-4) simulating the participant’s continuation.

Categories:

Remote sensing has become critical for understanding environmental dynamics, urban planning, and disaster management. However, traditional remote sensing workflows often rely on explicit segmentation or detection methods, which struggle to handle complex, implicit queries that require reasoning over spatial context, domain knowledge, and implicit user intent. Motivated by this, we introduce a new task, i.e., geospatial pixel reasoning, which allows implicit querying and reasoning and generates the mask of the target region.

Categories:

This dataset contains 535 recordings of heart and lung sounds captured using a digital stethoscope from a clinical manikin, including both individual and mixed recordings of heart and lung sounds; 50 heart sounds, 50 lung sounds, and 145 mixed sounds. For each mixed sound, the corresponding source heart sound (145 recordings) and source lung sound (145 recordings) were also recorded. It includes recordings from different anatomical chest locations, with normal and abnormal sounds.

Categories:

Fair Use for Academic Research: If you use this dataset, please cite the following paper to ensure proper attribution

M. A. Onsu, P. Lohan, B. Kantarci, A. Syed, M. Andrews, S. Kennedy, "Leveraging Multimodal-LLMs Assisted by Instance Segmentation for Intelligent Traffic Monitoring," 30th IEEE Symposium on Computers and Communications (ISCC), July 2025, Bologna, Italy.

 

 

Preprint available here: https://arxiv.org/pdf/2502.11304

 

Categories: