Faculty of Engineering and Technology
Ramaiah University of Applied Sciences
Department Computer Science and Engineering Programme B. Tech
Semester/Batch 4/2022
Course Code 20CSC215A Course Title Advance Data structure
Course Leader Dr Subarna Chatterjee
Assignment
Register No Group no: Group 14
Marks
Sect First
Marking Scheme Max Moderator
ions Examiner
Marks Marks
Marks
A 1.1 Introduction to data structures 01
A 1.2 Efficient information retrieval techniques 02
Part
-A Real life example of data structure in massive data
A 1.3 02
storage
Part-A Max Marks 05
B 1.1 Introduction to substitution ciphers 01
B 1.2 Algorithm 02
Part
B 1.3 s Justification of data structures used in the solution 04
B1
B 1.4 C Program 03
Part-B 1 Max Marks 10
B2.1 Plagiarism rules and threshold 02
B2.2 Pseudocode for checking plagiarized content 03
Part B2.3 C Program 03
B2 B2.4 Conclusion and future scope 02
Part-B 3 Max Marks 10
Total Assignment Marks 25
Course Marks Tabulation
Component- CET B First Second
Remarks Remarks
Assignment Examiner Examiner
A
B.1
B.2
Marks (Max 25)
Signature of First Examiner Signature of Second Examiner
1
Please note:
1. Documental evidence for all the components/parts of the assessment such as the reports,
photographs, laboratory exam / tool tests are required to be attached to the assignment report in a
proper order.
2. The First Examiner is required to mark the comments in RED ink and the Second Examiner’s
comments should be in GREEN ink.
3. The marks for all the questions of the assignment have to be written only in the Component – CET
B: Assignment table.
4. If the variation between the marks awarded by the first examiner and the second examiner lies
within +/- 3 marks, then the marks allotted by the first examiner is considered to be final. If the
variation is more than +/- 3 marks then both the examiners should resolve the issue in consultation
with the Chairman BoE.
Instructions to students:
1. The assignment consists of 3 questions: Part A – 1 Question, Part B- 2 Questions.
2. Maximum marks is 25.
3. The assignment has to be neatly word processed as per the prescribed format.
4. The maximum number of pages should be restricted to 20.
5. Restrict your report for Part-A to 3 pages only.
6. Restrict your report for Part-B to a maximum of 17 pages.
7. The printed assignment must be submitted to the course leader.
8. Submission Date: 12/07 /2024
9. Submission after the due date is not permitted.
10. IMPORTANT: It is essential that all the sources used in preparation of the assignment must be
suitably referenced in the text.
11. Marks will be awarded only to the sections and subsections clearly indicated as per the problem
statement/exercise/question
2
Assignment
Preamble:
This course is aimed at preparing the students to understand and apply the principles of data structures
and algorithms, implement standard data structures and develop algorithms for efficient computer
programs. A broad range of abstract data types as well as algorithms for data storage, access and
manipulation used in program development are taught. Students are trained to develop applications
using appropriate ADTs and algorithms, analyze them and generate an analytical report.
PART – A 5 Marks
Data structure is a systematic way of organizing and accessing data. Data structures are used in collecting
and storing massive collections of data. Data structures help in making data available using operations
such as indexing, searching, sorting etc. It assists the computer to understand human-generated
documents and artifacts of all kinds such as speech, video, text, motion, biometrics etc. With this
reference, write an essay on:
Data structures and information retrieval techniques for natural-language processing
The essay should address the following:
A1.1 Introduction to data structures
A1.3 Efficient information retrieval techniques
A1.4 A real life example of data structure for massive data storage
PART – B (20 marks)
B.1 10 marks
Encryption is used to keep the data secret. In an encryption process, a file or data transmission is garbled
so that only authorized people with a secret key can unlock the original text. Consider the use of
encryption for the purpose of security in net banking or in a credit card (either by swiping, inserting or
tapping). Design an encryption software that uses substitution cipher techniques to provide
confidentiality and authentication for e-transactions. Your report should include
B1.1 Introduction to substitution ciphers
B1.2 Algorithm
B1.3 Justification of data structures used in the solution
B1.4 C Program
B.2 10 marks
Plagiarism is a serious problem in research ethics. Implement a simple plagiarism detector. Accept a
corpus of existing documents and a potentially plagiarized document. Develop an algorithm that
performs the plagiarism check and determines the copied text and its sources. Your report should
include the following:
B2.1 Plagiarism rules and threshold
B2.2 Pseudocode for checking plagiarized content
B2.3 C Program
B2.4 Conclusion and future scope
3
Part - A
Data Structures and Information Retrieval Techniques for Natural Language Processing
A1.1 Introduction to Data Structures:
Data structures are essential constructs that allow for the efficient organization, storage, and retrieval of
data. They are pivotal in permitting the improvement of efficient algorithms that are essential for solving
complicated computational problems. At their center, facts systems provide the method to control huge
volumes of facts correctly, allowing operations along with indexing, looking, and sorting to be completed
quickly and [Link] structures may be categorized into numerous classes based on their
nature and use cases. These include:
● Linear Data Structures: These structures arrange data in a sequential way. Examples include
arrays, linked lists, stacks, and queues. They are easy and smooth to implement but may not
always be efficient for complex data operations.
● Non-Linear Data Structures: These systems arrange facts in hierarchy. Examples include trees
and graphs. They’re suitable for representing relationships among records elements and are
often utilized in situations like network routing and organizing hierarchical information.
● Hash-Based Data Structures: These structures use hash functions to map information to unique
places for instant retrieval. Examples consist of hash tables and hash maps. They are particularly
efficient for lookup operations.
● File-Based Data Structures: These are used for storing data in an outside garage, which includes
files. Examples include B-timber and B bushes, generally used in database indexing and record
systems.
Data structures are crucial in various domains, which includes database management, running systems,
and synthetic intelligence. In the context of herbal language processing (NLP), they shape the spine for
handling and manipulating massive textual datasets.
A1.2 Efficient Information Retrieval Techniques
Efficient records retrieval (IR) is crucial for managing the sizable amounts of data generated and
processed in NLP. IR strategies goal to fetch applicable records from a large repository in response to user
queries. Several facts systems and algorithms are employed to obtain efficient IR:
● Inverted Index: An inverted index is a fundamental facts shape utilized in engines like google. It
maps phrases (keywords) to their occurrences in a record or a set of documents. This permits for
quick retrieval of documents containing unique phrases, enabling rapid and accurate search
outcomes.
● Suffix Trees and Arrays: These statistics systems are used for green substring seek and sample
matching. A suffix tree is a compressed trie of all suffixes of a given textual content, permitting
short searches for styles and repeated sequences. Suffix arrays offer a greater area-efficient
alternative at the same time as preserving green search talents.
● Tries and Radix Trees: Tries are tree-like fact structures used to shop dynamic sets of strings.
They are especially beneficial for autocomplete and spell-checking applications. Radix trees, a
variant of attempts, offer a more compact representation, optimizing space usage while
preserving speedy lookup times.
● Okay-d Trees and R-timber: These records systems are used for multi-dimensional indexing,
normally carried out in spatial databases and geographic records systems. They enable efficient
nearest neighbor searches and variety queries.
● Bloom Filters: A probabilistic statistics structure used to test whether an element is a member
of a fixed. Bloom filters are extraordinarily space-green and are regularly utilized in applications
in which a small quantity of fake positives is appropriate, inclusive of caching and database query
optimization.
In NLP, those IR techniques facilitate obligations including file retrieval, query answering, and records
extraction, enhancing the capability to procedure and recognize human-generated content material.
4
A1.3 A Real-Life Example of Data Structure for Massive Data Storage
A distinguished real-life example of a data shape used for a massive information garage is the Google File
System (GFS). GFS is an allotted document gadget advanced by Google to manipulate big-scale statistics
processing workloads generated by using search and web indexing operations.
GFS is designed to offer excessive fault tolerance even as jogging on less expensive commodity hardware.
It employs several key records structures and strategies:
● Chunks and Chunk Servers: Data is divided into fixed-length chunks, each diagnosed with the aid
of a unique sixty four-bit handle. These chunks are stored across a couple of chew servers,
imparting redundancy and fault tolerance. Metadata approximately the chunks is maintained in
a valuable grasp server.
● Master Server: The grasp server maintains metadata, which includes the namespace, get
admission to manipulate facts, and the mapping of chunks to chew servers. It manages
system-huge sports consisting of chunk advent, deletion, and replication, making sure
consistency and reliability.
● Replication: Each chunk is replicated throughout more than one bite servers (commonly three)
to make certain facts availability and fault tolerance. In case of hardware failure, the device can
nevertheless retrieve information from the last replicas.
● Snapshot and Record Append: GFS supports picture and record append operations, allowing
green facts backup and concurrent writes. Snapshots permit for creating a copy of a record or
directory at a selected factor in time, while document append guarantees atomicity for
concurrent write operations.
GFS has proven to be especially effective in dealing with Google's full-size statistics processing needs,
demonstrating the strength of nicely-designed statistics systems and dispensed structures in coping with
large records garages.
Conclusion:
Data systems and records retrieval strategies are integral in the realm of herbal language processing.
They provide the vital framework for organizing, storing, and retrieving massive volumes of records
efficiently. From inverted indexes and suffix trees to distributed record structures like GFS, those
equipment permit the improvement of state-of-the-art NLP applications able to process and understand
complicated human-generated content material. As the demand for superior NLP answers continues to
grow, the significance of green statistics systems and IR techniques will emerge as greater mentioned.
5
PART – B
B1.1
Introduction to Substitution Ciphers
Substitution ciphers are one of the simplest types of encryption where every letter within the plaintext is
replaced with another letter or image in the ciphertext in line with a fixed device. There are two most
important kinds of substitution ciphers:
● Monoalphabetic Substitution Cipher : Each letter is changed continuously during the textual
content with some other letter. For instance, 'A' may usually be replaced with 'D', 'B' with 'F', and
so on.
● Polyalphabetic Substitution Cipher: Different letters inside the plaintext are probably replaced
by specific letters in the ciphertext, relying on the position of the letter inside the text. The most
famous polyalphabetic cipher is the Vigenère cipher.
B1.2
Algorithm:
For the motive of designing an encryption software program for e-transactions using substitution ciphers,
we will concentrate on a simple shape of monoalphabetic substitution cipher for demonstration.
Algorithm Overview:
● Key Generation:
🡲 Generate a random permutation of the alphabet (key) to create the substitution mapping.
🡲 This key should be securely stored and shared most effectively with authorized events.
● Encryption Process:
🡲 Input: Plain text (message to be encrypted).
🡲 Output: Cipher text (encrypted message).
🡲 Replace each letter inside the plaintext consistent with the substitution mapping supplied via
the important thing.
● Decryption Process:
🡲 Input: Cipher textual content (encrypted message).
🡲 Output: Plain textual content (decrypted message).
🡲 Use the inverse of the substitution mapping (derived from the important thing) to decrypt the
message.
● Example:
🡲 Suppose our secret is: `plaintext: ABCDEFGHIJKLMNOPQRSTUVWXYZ`
🡲 ciphertext: ZXCVBNMASDFGHJKLQWERTYUIOP`
🡲 Using this key, 'A' would be replaced with 'Z', 'B' with 'X', and so on.
Implementation Considerations:
● Security: The security of substitution ciphers relies upon completely on the secrecy of the key. If
the key is compromised, the encryption may be without difficulty broken.
● Authentication: To make sure authenticity in e-transactions, extra measures like digital
signatures or MACs (Message Authentication Codes) must be used alongside encryption.
● Performance: Substitution ciphers are computationally simple, making them suitable for
aid-restricted environments like embedded structures or cell devices.
Conclusion
In conclusion, substitution ciphers offer a simple but effective method of encrypting facts for
confidentiality in e-transactions. However, their simplicity means they must be augmented with extra
security features to make certain authentication and safety against diverse cryptographic attacks.
This document outlines a foundational technique to designing encryption software program the usage of
substitution ciphers, emphasizing their application in securing net banking and credit card transactions.
6
B1.3
Justification of Data Structures Used in the Solution
● Array: Arrays are used to preserve the alphabet for brief research and transferring of characters.
This lets in constant-time access to characters and their positions.
● String: Plaintext and ciphertext are treated as strings, permitting smooth manipulation and
generation over each man or woman.
● Integer: The secret is stored as an integer, representing the shift amount, which simplifies the
moving technique all through encryption and decryption.
● Character Handling Functions: Functions like isalpha, islower, isupper ensure that this system
efficiently identifies and handles alphabetic characters at the same time as ignoring
non-alphabetic characters.
B.1.4
7
OUTPUT:
8
B.2
B2.1 Plagiarism rules and threshold:
Plagiarism detection entails identifying instances where vast portions of textual content in a file
(suspected plagiarism) fit current documents (corpus). Determining plagiarism requires establishing clear
regulations and thresholds. Here's a breakdown:
Rules:
Matching Text Length: Define a minimal duration for matching text segments. For example, don't forget
a fit if five or greater consecutive phrases are identical in both files.
Similarity Threshold: Set a percentage threshold for similarity between matching segments. For
instance, eighty% similarity or above may be taken into consideration plagiarism.
Exclusion of Common Phrases: Exclude not unusual terms like "the short brown fox" or prevent words
(articles, prepositions) from similarity calculations. This reduces false positives.
Threshold:
The plagiarism threshold is the minimal degree of similarity or matching text period that triggers a
plagiarism flag. Setting the right threshold is crucial:
High Threshold: Might miss sizable plagiarism instances with moderate modifications.
Low Threshold: Could flag not unusual phrases or idioms as plagiarism, mainly to inaccurate outcomes.
Finding the Right Balance:
The best threshold depends on the context. For educational papers, a stricter threshold (e.G., 80%
similarity) might be wanted. For innovative writing with common phrasings, a much less strict threshold
(e.G., 70%) may be suitable
B2.2
Algorithm:
1. Define constants for the maximum number of documents and the maximum length of each document.
2. Define a function `calculate_similarity` to calculate the similarity between two strings.
3. Declare arrays and variables to store the documents and the plagiarized text.
4. Read the number of documents from the user.
5. Ensure the number of documents does not exceed the maximum limit.
6. Read each document from the user.
7. Read the plagiarized document from the user.
8. Check each document in the corpus for potential plagiarism and print the results if any similarity is
found.
9
Pseudocode:
10
B2.3 C Program
11
OUTPUT:
12
B2.4 Conclusion and future scope
This document offered a simple plagiarism detection set of rules, the use of sliding windows and
similarity thresholds. It highlights the importance of defining clear regulations, which includes minimum
matching text length and similarity thresholds, to efficiently come across capability plagiarism. However,
this method has limitations:
● Limited Accuracy: Simple similarity calculations can miss paraphrased plagiarism or plagiarism
with minor phrase changes.
● False Positives: Common phrases or idioms may cause false alarms.
● Context Dependence: The most efficient threshold depends on the context (educational vs.
Innovative writing).
The future of plagiarism detection lies in exploring extra superior strategies:
● N-grams and Shingles: Analyze sequences of phrases (n-grams) or overlapping subsequences
(shingles) to perceive matches regardless of rearrangements.
● Fingerprinting: Create specific record representations (fingerprints) based on phrase frequencies
or statistical properties, enabling green plagiarism detection.
● Machine Learning: Train gadget getting to know fashions on massive datasets of plagiarized and
non-plagiarized textual content to learn nuanced styles and improve accuracy.
● Citation Network Analysis: Analyze citation styles and relationships among files to become
aware of ability plagiarism resources.
*****
13