Data Analysis vs. Data Mining Explained
Data Analysis vs. Data Mining Explained
Data analysis involves the process of cleaning, organizing, and using data to produce meaningful insights. Data mining is
used to search for hidden patterns in the data.
Data analysis produces results that are far more comprehensible by a variety of audiences than the results from data
mining.
Data validation, as the name suggests, is the process that involves determining the accuracy of data and the quality of the
source as well. There are many processes in data validation but the main ones are data screening and data verification.
• Data screening: Making use of a variety of models to ensure that the data is accurate and no redundancies are
present.
• Data verification: If there is a redundancy, it is evaluated based on multiple steps and then a call is taken to ensure
the presence of the data item.
Data analysis is the structured procedure that involves working with data by performing activities such as ingestion,
cleaning, transforming, and assessing it to provide insights, which can be used to drive revenue.
Data is collected, to begin with, from varied sources. Since the data is a raw entity, it has to be cleaned and processed to fill
out missing values and to remove any entity that is out of the scope of usage.
After pre-processing the data, it can be analyzed with the help of models, which use the data to perform some analysis on it.
The last step involves reporting and ensuring that the data output is converted to a format that can also cater to a non-
technical audience, alongside the analysts.
This Data Analytics Training in Bangalore will help you achieve your dream of becoming a professional data analyst.
This question is subjective, but certain simple assessment points can be used to assess the accuracy of a data model. They
are as follows:
• A well-designed model should offer good predictability. This correlates to the ability to be easily able to predict
future insights when needed.
• A rounded model adapts easily to any change made to the data or the pipeline if need be.
• The model should have the ability to cope in case there is an immediate requirement to large-scale the data.
• The model’s working should be easy and it should be easily understood among clients to help them derive the
required results.
Data Cleaning is also called Data Wrangling. As the name suggests, it is a structured way of finding erroneous content in data
and safely removing them to ensure that the data is of the utmost quality. Here are some of the ways in data cleaning:
There can be many issues that a Data Analyst might face when working with data. Here are some of them:
• The accuracy of the model in development will be low if there are multiple entries of the same entity and errors
concerning spellings and incorrect data.
• If the source the data being ingested from is not a verified source, then the data might require a lot of cleaning and
preprocess before beginning the analysis.
• The same goes for when extracting data from multiple sources and merging them for use.
• The analysis will take a backstep if the data obtained is incomplete or inaccurate.
Data profiling is a methodology that involves analyzing all entities present in data to a greater depth. The goal here is to
provide highly accurate information based on the data and its attributes such as the datatype, frequency of occurrence, and
more.
Data is never a stagnant entity. If there is an expansion of business, this could cause openings of sudden opportunities that
might call for a change in the data. Furthermore, assessing the model to check its standing can help the Analyst analyze
whether the model is to be re-trained or not.
However, the general rule of thumb is to ensure that the models are re-trained when there is a change in the business
protocols and offerings.
There are many skills that a budding Data Analyst needs. Here are some of them:
• Being well-versed in programming languages such as XML, JavaScript, and ETL frameworks
10: What are the top tools used to perform Data Analysis?
There is a wide spectrum of tools that can be used in the field of data analysis. Here are some of the popular ones:
• RapidMiner
• Tableau
• KNIME
• OpenRefine
An outlier is a value in a dataset that is considered to be away from the mean of the characteristic feature of the dataset.
There are two types of outliers: univariate and multivariate.
12: How can we deal with problems that arise when the data flows in from a variety of sources?
There are many ways to go about dealing with multi-source problems. However, these are done primarily to solve the
problems of:
• Identifying the presence of similar/same records and merging them into a single record
13: What are some of the popular tools used in Big Data?
There are multiple tools that are used to handle Big Data. Some of the most popular ones are as follows:
• Hadoop
• Spark
• Scala
• Hive
• Flume
• Mahout
Pivot tables are one of the key features of Excel. They allow a user to view and summarize the entirety of large datasets
simply. Most of the operations with Pivot tables involve drag-and-drop operations that aid in the quick creation of reports.
KNN is the method that requires the selection of several nearest neighbors and a distance metric at the same time. It can
predict both discrete and continuous attributes of a dataset.
A distance function is used here to find the similarity of two or more attributes, which will help in further analysis.
16: What are the top Apache frameworks used in a distributed computing environment?
MapReduce and Hadoop are considered to be the top Apache frameworks when the situation calls for working with a huge
dataset in a distributed working environment.
Hierarchical clustering, or hierarchical cluster analysis, is an algorithm that groups similar objects into common groups called
clusters. The goal is to create a set of clusters, where each cluster is different from the other and, individually, they contain
similar entities.
18: What are the steps involved when working with a Data Analysis project?
Many steps are involved when working end-to-end on a data analysis project. Some of the important steps are as
mentioned below:
• Problem statement
• Data cleaning/preprocessing
• Data exploration
• Modeling
• Data validation
• Implementation
• Verification
19: Can you name some of the statistical methodologies used by Data Analysts?
There are many statistical techniques that are very useful when performing data analysis. Here are some of the important
ones:
• Markov process
• Cluster analysis
• Imputation techniques
• Bayesian methodologies
• Rank statistics
Time series analysis, or TSA, is a widely used statistical technique when working with trend analysis and time-series data in
particular. The time-series data involves the presence of the data at particular intervals of time or set periods.
Since time series analysis (TSA) has a wide scope of usage, it can be used in multiple domains. Here are some of the places
where TSA plays an important role:
• Statistics
• Signal processing
• Econometrics
• Weather forecasting
• Earthquake prediction
• Astronomy
• Applied science
Any clustering algorithm, when implemented will have the following properties:
• Flat or hierarchical
• Iterative
• Disjunctive
Collaborative filtering is an algorithm used to create recommendation systems mainly considering the behavioral data of a
customer or a user.
For example, when browsing through e-commerce sites, a section called ‘Recommended for you’ is present. This is done
using the browsing history, alongside analyzing the previous purchases and collaborative filtering.
There are many types of hypothesis testing. Some of them are as follows:
• Analysis of variance (ANOVA): Here, the analysis is conducted between the mean values of multiple groups.
• T-test: This form of testing is used when the standard deviation is not known and the sample size is relatively less.
• Chi-square test: This kind of hypothesis testing is used when there is a requirement to find out the level of
association between the categorical variables in a sample.
25: What are some of the data validation methodologies used in Data Analysis?
Many types of data validation techniques are used today. Some of them are:
• Field-level validation: Validation is done across each of the fields to ensure that there are no errors in the data
entered by the user.
• Form-level validation: Here, validation is done when the user completes working with the form but before the
information is saved.
• Data saving validation: This form of validation takes place when the file or the database record is being saved.
• Search criteria validation: This kind of validation is used to check whether valid results are returned when the user
is looking for something.
If you are considering becoming proficient in Data Analytics and earning a certification while doing the same, make sure to
check out Intellipaat’s online Data Analyst Course.
K-means algorithm clusters data into different sets based on how close the data points are to each other. The number of
clusters is indicated by ‘k’ in the k-means algorithm. It tries to maintain a good amount of separation between each of the
clusters.
However, since it works in an unsupervised nature, the clusters will not have any sort of labels to work with.
27: What is the difference between the concepts of recall and the true positive rate?
Recall and the true positive rate, both are totally identical. Here’s the formula for it:
28: What are the ideal situations in which t-test or z-test can be used?
It is a standard practice that a t-test is used when there is a sample size less than 30 and the z-test is considered when the
sample size exceeds 30 in most cases.
It is called naive because it makes a general assumption that all the data present are unequivocally important and
independent of each other. This is not true and won’t hold good in a real-world scenario.
30: What is the simple difference between standardized and unstandardized co-efficients?
In the case of standardized co-efficients, they are interpreted based on their standard deviation values. While the
unstandardized coefficient is measured based on the actual value present in the dataset.
Multiple methodologies can be used for detecting outliers, but the two most commonly used methods are as follows:
• Standard deviation method: Here, the value is considered as an outlier if the value is lower or higher than three
standard deviations from the mean value.
• Box plot method: Here, a value is considered to be an outlier if it is lesser or higher than 1.5 times the interquartile
range (IQR)
K-Nearest Neighbour (KNN) is preferred here because of the fact that KNN can easily approximate the value to be
determined based on the values closest to it.
33: How can one handle suspicious or missing data in a dataset while performing analysis?
If there are any discrepancies in data, a user can go on to use any of the following methods:
• Escalating the same to an experienced Data Analyst to look at it and take a call
• Replacing the invalid data with a corresponding valid and up-to-date data
• Using many strategies together to find missing values and using approximation if needed
34: What is the simple difference between Principal Component Analysis (PCA) and Factor Analysis (FA)?
Among many differences, the major difference between PCA and FA lies in the fact that factor analysis is used to specify and
work with the variance between variables, but the aim of PCA is to explain the covariance between the existing components
or variables.
Next up on this top Data Analyst interview questions and answers, let us check out some of the top questions from the
advanced category.
• Establishes an easy way to compare files, identify differences, and merge if any changes are done
• Creates an easy way to track the life cycle of an application build, including every stage in it such as development,
production, testing, etc.
• Ensures that every version and variant of code is kept safe and secure
Next up on these interview questions for Data Analysts, we have to take a look at the trends regarding this domain.
With this question, the interviewer is trying to assess your grip on the subject and your research in the field. Make sure to
state valid facts and respective validation for sources to add positivity to your candidature. Also, try to explain how Artificial
Intelligence is making a huge impact on data analysis and its potential in the same.
37: Why are you applying for the Data Analyst role in our company?
Here, the interviewer is trying to see how well you can convince them regarding your proficiency in the subject, alongside
the need for data analysis at the firm you’ve applied for. It is always an added advantage to know the job description in
detail, along with the compensation and the details of the company.
38: Can you rate yourself on a scale of 1–10 depending on your proficiency in Data Analysis?
With this question, the interviewer is trying to grasp your understanding of the subject, your confidence, and your
spontaneity. The most important thing to note here is that you answer honestly based on your capacity.
39: Has your college degree helped you with Data Analysis in any way?
This is a question that relates to the latest program you completed in college. Do talk about the degree you have obtained,
how it was useful, and how you plan on putting it to full use in the coming days after being recruited in the company.
40: What is your plan after taking up this Data Analyst role?
While answering this question, make sure to keep your explanation concise on how you would bring about a plan that works
with the company set-up and how you would implement the plan, ensuring that it works by performing perforation
validation testing on the same. Do highlight how it can be made better in the coming days with further iteration.
Compared to the plethora of advantages, there are very few disadvantages when considering Data Analytics. Some of the
disadvantages are listed below:
• Data Analytics can cause a breach in customer privacy and their information such as transactions, purchases, and
subscriptions.
• It takes a lot of skills and expertise to select the right analytics tool every time.
This is a descriptive question that is highly dependent on how analytical your thinking skills are. There are a variety of tools
that a Data Analyst must have expertise in. Programming languages such as Python, R, and SAS, probability, statistics,
regression, correlation, and more are the primary skills that a Data Analyst should possess.
43: Why do you think you are the right fit for this Data Analyst role?
With this question, the interviewer is trying to gauge your understanding of the job description and where you’re coming
from, with respect to your knowledge of Data Analysis. Be sure to answer this in a concise yet detailed manner by explaining
your interests, goals, and visions and how these match with the company’s substructure.
44: Can you please talk about your past Data Analysis work?
This is a very commonly asked question in a data analysis interview. The interviewer will be assessing you for your clarity in
communication, actionable insights from your work experience, your debating skills if questioned on the topics, and how
thoughtful you are in your analytical skills.
45: Can you please explain how you would estimate the number of visitors to the Taj Mahal in November 2019?
This is a classic behavioral question. This is to check your thought process without making use of computers or any sort of
dataset. You can begin your answer using the below template:
‘First, I would gather some data. To start with, I’d like to find out the population of Agra, where the Taj Mahal is located. The
next thing I would take a look at is the number of tourists that came to visit the site during that time. This is followed by the
average length of their stay that can be further analyzed by considering factors such as age, gender, and income, and the
number of vacation days and bank holidays there are in India. I would also go about analyzing any sort of data available from
the local tourist offices.’
46: Do you have any experience working in the same industry as ours before?
This is a very straightforward question. This aims to assess if you have the industry-specific skills that are needed for the
current role. Even if you do not possess all of the skills, make sure to thoroughly explain how you can still make use of the
skills you’ve obtained in the past to benefit the company.
47: Have you earned any sort of certifications to boost your opportunities as a Data Analyst aspirant?
As always, interviewers look for candidates who are serious about advancing their career options by making use of
additional tools like certifications. Certificates are strong proof that you have put in all the efforts to learn new skills, master
them, and put them to use to the best of your capacity. List the certifications, if you have any, and do talk about them in
brief, explaining what all you learned from the program and how it’s been helpful to you so far.
48: What tools do you prefer to use in the various phases of Data Analysis?
This again is a question to check what tools you think are useful for their respective tasks. Do talk about how comfortable
you are with the tools you mention and about their popularity in the market today.
49: Which step of a Data Analysis project do you like the most?
Do know that it is completely normal to have a predilection toward certain tools and tasks over others. However, while
performing data analysis, you will always be expected to deal with the entirety of the analytics life cycle, so make sure not to
speak negatively about any of the tools or of the steps in the process of data analysis.
Finally, in this interview questions for the Data Analysts blog, we have to understand how to carefully approach this
question and answer it to the best of our ability.
50: How good are you in terms of explaining technical content to a non-technical audience with respect to Data Analysis?
This is another classic question asked in most of the Data Analytics interviews. Here, it is extremely vital that you talk about
your communication skills in terms of delivering the technical content, your level of patience, and your ability to break
content into smaller chunks to help the audience understand better.
It is always advantageous to show the interviewer that you are very well capable of working effectively with people from a
variety of backgrounds who may or may not be technical.
If you are looking forward to learning and mastering all of the Data Analytics and Data Science concepts and earning a
certification in the same, do take a look at Intellipaat’s latest Data Science with R Certification offerings.
Understand the business problem, define the organizational goals, and plan for a lucrative solution.
Collecting Data
Gather the right data from various sources and other information based on your priorities.
Cleaning Data
Clean the data to remove unwanted, redundant, and missing values, and make it ready for analysis.
Use data visualization and business intelligence tools, data mining techniques, and predictive modeling to analyze data.
Interpret the results to find out hidden patterns, future trends, and gain insights
4. What are the common problems that data analysts encounter during analysis?
The common problems steps involved in any analytics project are:
• Handling duplicate
MS Excel, Tableau
Python, R, SPSS
MS PowerPoint
• Before working with the data, identify and remove the duplicates. This will lead to an easy and effective data analysis
process.
• Focus on the accuracy of the data. Set cross-field validation, maintain the value types of data, and provide mandatory
constraints.
• Normalize the data at the entry point so that it is less chaotic. You will be able to ensure that all information is
standardized, leading to fewer errors on entry.
• It helps you obtain confidence in your data to a point where you’re ready to engage a machine learning algorithm.
• It allows you to refine your selection of feature variables that will be used later for model building.
• You can discover hidden trends and insights from the data.
8. Explain descriptive, predictive, and prescriptive analytics.
Uses simulation
algorithms and
optimization techniques
Uses data aggregation and Uses statistical models and to advise possible
data mining techniques forecasting techniques outcomes
Sampling is a statistical method to select a subset of data from an entire dataset (population) to estimate the characteristics
of the whole population.
• Systematic sampling
• Cluster sampling
• Stratified sampling
Univariate analysis can be described using Central Tendency, Dispersion, Quartiles, Bar charts, Histograms, Pie charts, and
Frequency distribution tables.
The bivariate analysis involves the analysis of two variables to find causes, relationships, and correlations between the
variables.
Example – Analyzing the sale of ice creams based on the temperature outside.
The bivariate analysis can be explained using Correlation coefficients, Linear regression, Logistic regression, Scatter plots,
and Box plots.
The multivariate analysis involves the analysis of three or more variables to understand the relationship of each variable
with the other variables.
Multivariate analysis can be performed using Multiple regression, Factor analysis, Classification & regression trees, Cluster
analysis, Principal component analysis, Dual-axis charts, etc
Listwise Deletion
In the listwise deletion method, an entire record is excluded from analysis if any single value is missing.
Average Imputation
Take the average value of the other participants' responses and fill in the missing value.
Regression Substitution
It creates plausible values based on the correlations for the missing data and then averages the simulated datasets by
incorporating random errors in your predictions.
Normal Distribution refers to a continuous probability distribution that is symmetric about the mean. In a graph, normal
distribution will appear as a bell curve.
• 68% of the data falls within one standard deviation of the mean
• 95% of the data lies between two standard deviations of the mean
• 99.7% of the data lies between three standard deviations of the mean
Time Series analysis is a statistical procedure that deals with the ordered sequence of values of a variable at equally spaced
time intervals. Time series data are collected at adjacent periods. So, there is a correlation between the observations. This
feature distinguishes time-series data from cross-sectional data.
This is another frequently asked data analyst interview question, and you are expected to cover all the given differences!
Overfitting Underfitting
The performance drops considerably over the test Performs poorly both on the
set. train and the test set.
The graph depicted below shows there are three outliers in the dataset.
To deal with outliers, you can use the following four methods:
Hypothesis testing is the procedure used by statisticians and scientists to accept or reject statistical hypotheses. There are
mainly two types of hypothesis testing:
• Null hypothesis: It states that there is no relation between the predictor and outcome variables in the population. H0
denoted it.
• Alternative hypothesis: It states that there is some relation between the predictor and outcome variables in the
population. It is denoted by H1.
In Hypothesis testing, a Type I error occurs when the null hypothesis is rejected even if it is true. It is also known as a false
positive.
A Type II error occurs when the null hypothesis is not rejected, even if it is false. It is also known as a false negative.
Excel Data Analyst Interview Questions
19. In Microsoft Excel, a numeric value can be treated as a text value if it precedes with what?
20. What is the difference between COUNT, COUNTA, COUNTBLANK, and COUNTIF in Excel?
22. Can you provide a dynamic range in “Data Source” for a Pivot table?
Yes, you can provide a dynamic range in the “Data Source” of Pivot tables. To do that, you need to create a named range
using the offset function and base the pivot table using a named range constructed in the first step.
17. What is the function to find the day of the week for a particular date value?
The get the day of the week, you can use the WEEKDAY() function.
The above function will return 6 as the result, i.e., 17th December is a Saturday.
18. How does the AND() function work in Excel?
AND() is a logical function that checks multiple conditions and returns TRUE or FALSE based on whether the conditions are
met.
Syntax: AND(logica1,[logical2],[logical3].)
In the below example, we are checking if the marks are greater than 45. The result will be true if the mark is
VLOOKUP is used when you need to find things in a table or a range by row.
lookup_value - The value to look for in the first column of a table table - The table from where you can extract value
range_lookup - [optional] TRUE = approximate match (default). FALSE = exact match Let’s understand VLOOKUP with an
example.
If you wanted to find the department to which Stuart belongs to, you could use the VLOOKUP function as shown below:
Here, A11 cell has the lookup value, A2:E7 is the table array, 3 is the column index number with information about
departments, and 0 is the range lookup.
If you hit enter, it will return “Marketing”, indicating that Stuart is from the marketing department
17. What function would you use to get the current date and time in Excel?
In Excel, you can use the TODAY() and NOW() function to get the current date and time.
18. Using the below sales table, calculate the total quantity sold by sales representatives whose name starts with A,
and the cost of each item they have sold is greater than 10.
You can use the SUMIFS() function to find the total quantity.
For the Sales Rep column, you need to give the criteria as “A*” - meaning the name should start with the letter “A”. For the
Cost each column, the criteria should be “>10” - meaning the cost of each item is greater than 10.
• Select the entire table range, click on the Insert tab and choose PivotTable
• Select the table range and the worksheet where you want to place the pivot table
• Drag Sale total on to Values, and Sales Rep and Item on to Row Labels. It will give the sum of sales made by each
representative for every item they have sold.
• Right-click on “Sum of Sale Total’ and expand Show Values As to select % of Grand Total.
1) What is Python?
Windows, Linux, UNIX, and Macintosh. Its high-level built-in data structures, combined with dynamic typing and dynamic
binding. It is widely used in data science, machine learning and artificial intelligence domain.
It is easy to learn and require less code to develop the applications. It is widely used for:
o Software development.
o Mathematics.
o System scripting.
2) Why Python?
o Python is compatible with different platforms like Windows, Mac, Linux, Raspberry Pi, etc.
o Python allows a developer to write programs with fewer lines than some other programming languages.
o Python runs on an interpreter system, means that the code can be executed as soon as it is written. It helps to
provide a prototype very quickly.
o The Python interpreter and the extensive standard library are available in source or binary form without charge for
all major platforms, and can be freely distributed.
Python is used in various software domains some application areas are given below.
o Games
o Language development
o Operating systems
Python provides various web frameworks to develop web applications. The popular python web frameworks are Django,
Pyramid, Flask.
Python's standard library supports for E-mail processing, FTP, IMAP, and other Internet protocols. Python's SciPy and NumPy
helps in scientific and computational application development.
Interpreted: Python is an interpreted language. It does not require prior compilation of code and executes instructions
directly.
Free and open source: It is an open-source project which is publicly available to reuse. It can be downloaded free of cost.
o It is Extensible
o Object-oriented
Object-oriented: Python allows to implement the Object-Oriented concepts to build application solution.
Built-in data structure: Tuple, List, and Dictionary are useful integrated data structures provided by the language.
o Readability
o High-Level Language
o Cross-platform
Portable: Python programs can run on cross platforms without affecting its performance.
5) What is PEP 8?
PEP 8 stands for Python Enhancement Proposal, it can be defined as a document that helps us to provide the guidelines on
how to write the Python code. It is basically a set of rules that specify how to format Python code for maximum readability.
It was written by Guido van Rossum, Barry Warsaw and Nick Coghlan in 2001.
Literals can be defined as a data which is given in a variable or constant. Python supports the following literals:
String Literals
String literals are formed by enclosing text in the single or double quotes. For example, string literals are string values.
Example:
1. # in single quotes
2. single 'AdnanManna'
3. # in double quotes
5. # multi-line String
6. multi = '''''Adnan
7. _
8. Manna'''
9.
10. print(single)
11. print(double)
Numeric Literals
Python supports three types of numeric literals integer, float and complex.
Example:
1. # Integer literal 2. a = 10
7. print(a)
8. print(b)
9. print(x) Output:
Boolean Literals
Boolean literals are used to denote Boolean values. It contains either True or False.
Example:
1. p = (1 == True)
2. q = (1 == False)
3. r = True + 3
4. s = False + 7 5.
6. print("p is", p)
7. print("q is", q)
8. print("r:", r)
9. print("s:", s)
Output:
Special literals
Python contains one special literal, that is, 'None'. This special literal is used for defining a null variable. If 'None' is
compared with anything else other than a 'None', it will return false.
Example:
1. word = None
2. print(word) Output:
A function is a section of the program or a block of code that is written once and can be executed whenever required in the
program. A function is a block of self-contained statements which has a valid name, parameters list, and body. Functions
make programming more functional and modular to perform modular tasks. Python provides several built-in functions to
complete tasks and also allows a user to create new functions as well.
o Built-In Functions: copy(), len(), count() are the some built-in functions.
o User-defined Functions: Functions which are defined by a user known as user-defined functions.
o Anonymous functions: These functions are also known as lambda functions because they are not declared with the
standard def keyword.
2. #--- statements---
3. return a_value
Python zip() function returns a zip object, which maps a similar index of multiple containers. It takes an iterable, convert into
iterator and aggregates the elements based on iterables passed. It returns an iterator of tuples.
Signature
Parameters
iterator1, iterator2, iterator3: These are iterator objects that are joined together.
Return
o Pass by references
o Pass by value
By default, all the parameters (arguments) are passed "by reference" to the functions. Thus, if you change the value of the
parameter within a function, the change is reflected in the calling function as
well. It indicates the original variable. For example, if a variable is declared as a = 10, and passed to a function where it's
value is modified to a = 20. Both the variables denote to the same value.
The pass by value is that whenever we pass the arguments to the function only values pass to the function, no reference
passes to the function. It makes it immutable that means not changeable. Both variables hold the different values, and
original value persists even after modifying in the function.
Python has a default argument concept which helps to call a method using an arbitrary number of arguments.
Python's constructor: _init () is the first method of a class. Whenever we try to instantiate an object
init () is automatically invoked by python to initialize members of an object. We can't overload constructors or methods in
Python. It shows an error if we try to overload.
Example:
1. class student:
3. [Link] = name
5. [Link] = name
6. [Link] = email 7.
Output:
11) What is the difference between remove() function and del statement?
The user can use the remove() function to delete a specific object in the list.
Example:
1. list_1 = [ 3, 5, 7, 3, 9, 3 ]
2. print(list_1)
3. list_1.remove(3)
Output:
If you want to delete an object at a specific location (index) in the list, you can either use del or pop. Example:
1. list_1 = [ 3, 5, 7, 3, 9, 3 ]
2. print(list_1)
3. del list_1[2]
4. print("After deleting: ", list_1)
Output:
We cannot use these methods with a tuple because the tuple is different from the list
It is a string's function which converts all uppercase characters into lowercase and vice versa. It is used to alter the existing
case of the string. This method creates a copy of the string which contains all the characters in the swap case. If the string is
in lowercase, it generates a small case string and vice versa. It automatically ignores all the non-alphabetic characters. See
an example below.
Example:
2. print([Link]()) 3.
5. print([Link]()) Output:
To remove the whitespaces and trailing spaces from the string, Python providies strip([str]) built-in function. This function
returns a copy of the string after removing whitespaces if present. Otherwise returns original string.
Example:
4. print(string)
5. print(string2)
6. print(string3)
8. print([Link]())
9. print([Link]())
To remove leading characters from a string, we can use lstrip() function. It is Python string function which takes an optional
char type parameter. If a parameter is provided, it removes the character. Otherwise, it removes all the leading spaces from
the string.
Example:
3. print(string)
4. print(string2)
6. print([Link]())
7. print([Link]()) Output:
After stripping, all the whitespaces are removed, and now the string looks like the below:
The join() is defined as a string method which returns a string value. It is concatenated with the elements of an iterable. It
provides a flexible way to concatenate the strings. See an example below.
Example:
1. str = "Rohan"
2. str2 = "ab"
3. # Calling function
4. str2 = [Link](str2)
5. # Displaying result
6. print(str2) Output:
This method shuffles the given string or an array. It randomizes the items in the array. This method is present in the random
module. So, we need to import it and then we can call the function. It shuffles elements each time when the function calls
and produces different output.
Example:
2. import random
3. # declare a list
6. print(sample_list1)
7. # first shuffle
8. [Link](sample_list1)
10. print(sample_list1)
12. [Link](sample_list1)
13. print("\nAfter the second shuffle of LIST1: ")
The break statement is used to terminate the execution of the current loop. Break always breaks the current execution and
transfer control to outside the current block. If the block is in a loop, it exits from the loop, and if the break is in a nested
loop, it exits from the innermost loop.
Example:
3. for i in list_1:
4. for j in list_2:
5. print(i, j)
7. print('BREAK')
8. break
9. else:
10. continue
A tuple is a built-in data collection type. It allows us to store values in a sequence. It is immutable, so no change is reflected
in the original data. It uses () brackets rather than [] square brackets to create a tuple. We cannot remove any element but
can find in the tuple. We can use indexing to get elements. It also allows traversing elements in reverse order by using
negative indexing. Tuple supports various methods like max(), sum(), sorted(), Len() etc.
Example:
3. # Displaying value
4. print(tup) 5.
7. print(tup[2]) Output:
3 # Displaying value
4. print(tup) 5.
7. print(tup[2]) 8.
The Python provides libraries/modules that enable you to manipulate text files and binary files on the file system. It helps to
create files, update their contents, copy, and delete files. The libraries are os, [Link], and shutil.
Here, os and [Link] - modules include a function for accessing the filesystem while shutil - module enables you to copy and
delete the files.
20) What are the different file processing modes supported by Python?
Python provides four modes to open files. The read-only (r), write-only (w), read-write (rw) and append mode (a). 'r' is used
to open a file in read-only mode, 'w' is used to open a file in write-only mode, 'rw' is used to open in reading and write
mode, 'a' is used to open a file in append mode. If the mode is not specified, by default file opens in read-only mode.
o Read-only mode (r): Open a file for reading. It is the default mode.
o Write-only mode (w): Open a file for writing. If the file contains data, data would be lost. Other a new file is created.
o Read-Write mode (rw): Open a file for reading, write mode. It means updating mode.
o Append mode (a): Open for writing, append to the end of the file, if the file exists.
An operator is a particular symbol which is used on some values and produces an output as a result. An operator works on
operands. Operands are numeric literals or variables which hold some values.
Operators can be unary, binary or ternary. An operator which requires a single operand known as a unary operator, which
require two operands known as a binary operator and which require three operands is called ternary operator.
Example:
1. # Unary Operator 2. A = 12
3. B = -(A)
4. print (B)
5. # Binary Operator 6. A = 12
7. B = 13
8. print (A + B)
9. print (B * A)
10. #Ternary Operator 11. A = 12
12. B = 13
Python uses a rich set of operators to perform a variety of operations. Some individual operators like membership and
identity operators are not so familiar but allow to perform operations.
o Assignment Operators
o Logical Operators
o Membership Operators
o Identity Operators
o Bitwise Operators
Arithmetic operators perform basic arithmetic operations. For example "+" is used to add and "?" is used for subtraction.
Example:
6. print(12*23)
Output:
Relational Operators are used to comparing the values. These operators test the conditions and then returns a boolean
value either True or False.
Example:
1. a, b = 10, 12
2. print(a==b) # False
3. print(a<b) # True
4. print(a<=b) # True
5. print(a!=b) # True
Output:
Assignment operators are used to assigning values to the variables. See the examples below.
Example:
4. a += 2
5. print(a) # 14 6. a -= 2
7. print(a) # 12
8. a *=2
9. print(a) # 24
10. a **=2
Output:
Logical operators are used to performing logical operations like And, Or, and Not. See the example below.
Example:
2. a = True
3. b = False
5. print(a or b) # True
6. print(not b) # True
Output:
Membership operators are used to checking whether an element is a member of the sequence (list, dictionary, tuples) or
not. Python uses two membership operators in and not in operators to check element presence. See an example.
Example:
4. cities = ("india","delhi")
Output:
Identity Operators (is and is not) both are used to check two values or variable which are located on the same part of the
memory. Two variables that are equal does not imply that they are identical. See the following examples.
Example:
3. b = 12
4. print(a is b) # False
Bitwise Operators are used to performing operations over the bits. The binary operators (&, |, OR) work on bits. See the
example below.
Example:
3. b = 12
4. print(a & b) # 8
5. print(a | b) # 14
6. print(a ^ b) # 6
7. print(~a) # -11
Output:
In Python 3, the old Unicode type has replaced by "str" type, and the string is treated as Unicode by default. We can make a
string in Unicode by using [Link]("utf-8") function.
Example:
2. print (unicode_1)
Output:
Python is an interpreted language. The Python language program runs directly from the source code. It converts the source
code into an intermediate language code, which is again translated into machine language that has to be executed.
o Memory management in python is managed by Python private heap space. All Python objects and data structures
are located in a private heap. The programmer does not have access to this private heap. The python interpreter takes care
of this instead.
o The allocation of heap space for Python objects is done by Python's memory manager. The core API gives access to
some tools for the programmer to code.
o Python also has an inbuilt garbage collector, which recycles all the unused memory and so that it can be made
available to the heap space.
Example:
1. def function_is_called():
2. def function_is_returned():
3. print("AdnanManna")
4. return function_is_returned
5. new_1 = function_is_called()
6. # Outputs "AdnanManna"
7. new_1()
Output:
A function is a block of code that performs a specific task whereas a decorator is a function that modifies other functions.
27) What are the rules for a local and global variable in Python?
Global Variables:
o Variables declared outside a function or in global space are called global variables.
o If a variable is ever assigned a new value inside the function, the variable is implicitly local, and we need to declare it
as 'global' explicitly. To make a variable globally, we need to declare it by using global keyword.
o Global variables are accessible anywhere in the program, and any function can access and modify its value.
Example:
1. A = "AdnanManna"
2. def my_function():
3. print(A)
4. my_function()
Output:
Local Variables:
o Any variable declared inside a function is known as a local variable. This variable is present in the local space and not
in the global space.
o If a variable is assigned a new value anywhere within the function's body, it's assumed to be a local.
1. def my_function2():
3. print(K)
4. my_function2()
Output:
The namespace is a fundamental idea to structure and organize the code that is more useful in large projects. However, it
could be a bit difficult concept to grasp if you're new to programming. Hence, we tried to make namespaces just a little
easier to understand.
A namespace is defined as a simple system to control the names in a program. It ensures that names are unique and won't
lead to any conflict.
Also, Python implements namespaces in the form of dictionaries and maintains name-to-object mapping where names act
as keys and the objects as values.
In Python, iterators are used to iterate a group of elements, containers like a list. Iterators are the collection of items, and it
can be a list, tuple, or a dictionary. Python iterator implements itr and next() method to iterate the stored elements. In
Python, we generally use loops to iterate over the collections (list, tuple).
In simple words: Iterators are objects which can be traversed though or iterated upon.
In Python, the generator is a way that specifies how to implement iterators. It is a normal function except that it yields
expression in the function. It does not implements itr and next() method and reduce other overheads as well.
If a function contains at least a yield statement, it becomes a generator. The yield keyword pauses the current execution by
saving its states and then resume from the same when required.
Slicing is a mechanism used to select a range of items from sequence type like list, tuple, and string. It is beneficial and easy
to get elements from a range by using slice way. It requires a : (colon) which separates the start and end index of the field.
All the data collection types List or tuple allows us to use slicing to fetch elements. Although we can get elements by
specifying an index, we get only single element whereas using slicing we can get a group of elements.
Example:
Output:
The Python dictionary is a built-in data type. It defines a one-to-one relationship between keys and values. Dictionaries
contain a pair of keys and their corresponding values. It stores elements in key and value pairs. The keys are unique whereas
values can be duplicate. The key accesses the dictionary elements.
Example:
The following example contains some keys Country Hero & Cartoon. Their corresponding values are India, Modi, and Rahul
respectively.
Output:
Pass specifies a Python statement without operations. It is a placeholder in a compound statement. If we want to create an
empty class or functions, the pass keyword helps to pass the control without error.
Example:
1. class Student:
3. class Student:
4. def info():
The Python docstring is a string literal that occurs as the first statement in a module, function, class, or method definition. It
provides a convenient way to associate the documentation.
String literals occurring immediately after a simple assignment at the top are called "attribute docstrings".
String literals occurring immediately after another docstring are called "additional docstrings". Python uses triple quotes to
create docstrings even though the string fits on one line.
Docstring phrase ends with a period (.) and can be multiple lines. It may consist of spaces and other special chars.
Example:
1. # One-line docstrings
2. def hello():
4. return "hello"
35) What is a negative index in Python and why are they used?
The sequences in Python are indexed and it consists of the positive as well as negative numbers. The numbers that are
positive uses '0' that is uses as first index and '1' as the second index and the process go on like that.
The index for the negative number starts from '-1' that represents the last index in the sequence and '- 2' as the penultimate
index and the sequence carries forward like the positive number.
The negative index is used to remove any new-line spaces from the string and allow the string to except the last character
that is given as S[:-1]. The negative index is also used to show the index to represent the string in correct order.
36) What is pickling and unpickling in Python?
The Python pickle is defined as a module which accepts any Python object and converts it into a string representation. It
dumps the Python object into a file using the dump function; this process is called Pickling.
The process of retrieving the original Python objects from the stored string representation is called as Unpickling.
Java and Python both are object-oriented programming languages. Let's compare both on some criteria given below:
Help() and dir() both functions are accessible from the Python interpreter and used for viewing a consolidated dump of
built-in functions.
Help() function: The help() function is used to display the documentation string and also facilitates us to see the help related
to modules, keywords, and attributes.
Dir() function: The dir() function is used to display the defined symbols.
39) What are the differences between Python 2.x and Python 3.x?
Python 2.x is an older version of Python. Python 3.x is newer and latest version. Python 2.x is legacy now. Python 3.x is the
present and future of this language.
The most visible difference between Python2 and Python3 is in print statement (function). In Python 2, it looks like print
"Hello", and in Python 3, it is print ("Hello").
The xrange() method has removed from Python 3 version. A new keyword as is introduced in Error handling.
In Python, some amount of coding is done at compile time, but most of the checking such as type, name, etc. are postponed
until code execution. Consequently, if the Python code references a user- defined function that does not exist, the code will
compile successfully. The Python code will fail only with an exception when the code execution path does not exist.
41) What is the shortest method to open a text file and display its content?
The shortest way to open a text file is by using "with" command in the following manner:
Example:
2. fileData = [Link]()
The enumerate() function is used to iterate through the sequence and retrieve the index position and its corresponding
value at the same time.
Example:
1. list_1 = ["A","B","C"]
2. s_1 = "Javatpoint"
4. object_1 = enumerate(list_1)
5. object_2 = enumerate(s_1) 6.
8. print (list(enumerate(list_1)))
9. print (list(enumerate(s_1)))
Output:
Since indexing starts from zero, an element present at 3rd index is 7. So, the output is 7.
Type conversion refers to the conversion of one data type iinto another.
int() - converts any data type into integer type float() - converts any data type into float type ord() - converts characters into
integer
list() - This function is used to convert any data type to a list type.
dict() - This function is used to convert a tuple of order (key,value) into a dictionary.
To send an email, Python provides smtplib and email modules. Import these modules into the created mail script and send
mail by authenticating a user.
It has a method SMTP(smtp-server, port). It requires two parameters to establish SMTP connection.
Example:
1. import smtplib
2. # Calling SMTP
3. s = [Link]('[Link]', 587)
5. [Link]()
7. [Link]("sender@email_id", "sender_email_id_password")
8. # Message to be sent
9. message = "Message_sender_need_to_send"
Arrays and lists, in Python, have the same way of storing data. But, arrays can hold only a single data type elements whereas
lists can hold any data type elements.
Example:
4. print (User_Array)
5. print (User_list)
Output
The anonymous function in python is a function that is defined without a name. The normal functions are defined using a
keyword "def", whereas, the anonymous functions are defined using the lambda function. The anonymous functions are
also called as lambda functions.
Lambda forms in Python does not have the statement because it is used to make the new function object and return them
in runtime.
A function is a block of code which is executed only when it is called. To define a Python function, the def keyword is used.
Example:
1. def New_func():
Output:
The init is a method or constructor in Python. This method is automatically called to allocate
memory when a new object/ instance of a class is created. All classes have the init method.
Example:
1. class Employee_1:
3. [Link] = name
4. [Link] = age
5. [Link] = 20000
9. print(E_1.name)
10. print(E_1.age)
Self is an instance or an object of a class. In Python, this is explicitly included as the first parameter. However, this is not the
case in Java where it's optional. It helps to differentiate between the methods and attributes of a class with local variables.
The self-variable in the init method refers to the newly created object while in other methods, it refers to the object whose
method was called.
Random module is the standard module that is used to generate a random number. The method is defined as:
1. import random
2. [Link]
The statement [Link]() method return the floating point number that is in the range of [0, 1). The function
generates random float numbers. The methods that are used with the random class are the bound methods of the hidden
instances. The instances of the Random can be done to show the multi-threading programs that creates a different instance
of individual threads. The other random generators that are used in this are:
randrange(a, b): it chooses an integer and define the range in-between [a, b). It returns the elements by selecting it
randomly from the range that is specified. It doesn't build a range object.
uniform(a, b): it chooses a floating point number that is defined in the range of [a,b).Iyt returns the floating point number
normalvariate(mean, sdev): it is used for the normal distribution where the mu is a mean and the sdev is a sigma that is
used for standard deviation.
The Random class that is used and instantiated creates independent multiple random number generators.
PYTHONPATH is an environment variable which is used when a module is imported. Whenever a module is imported,
PYTHONPATH is also looked up to check for the presence of the imported modules in various directories. The interpreter
uses it to determine which module to load.
54) What are python modules? Name some commonly used built-in modules in Python?
Python modules are files containing Python code. This code can either be functions classes or variables. A Python module is
a .py file containing executable code.
o os
o sys
o math
o random
o data time
o JSON
For the most part, xrange and range are the exact same in terms of functionality. They both provide a way to generate a list
of integers for you to use, however you please. The only difference is that range returns a Python list object and x range
returns an xrange object.
This means that xrange doesn't actually generate a static list at run-time like range does. It creates the values as you need
them with a special technique called yielding. This technique is used with a type of object known as generators. That means
that if you have a really gigantic range you'd like to generate a list for, say one billion, xrange is the function to use.
This is especially true if you have a really memory sensitive system such as a cell phone that you are working with, as range
will use as much memory as it can to create your array of integers, which can result in a Memory Error and crash your
program. It's a memory hungry beast.
56) What advantages do NumPy arrays offer over (nested) Python lists?
o Python's lists are efficient general-purpose containers. They support (fairly) efficient insertion, deletion, appending,
and concatenation, and Python's list comprehensions make them easy to construct and manipulate.
o They have certain limitations: they don't support "vectorized" operations like elementwise addition and
multiplication, and the fact that they can contain objects of differing types mean that Python must store type information
for every element, and must execute type dispatching code when operating on each element.
o NumPy is not just more efficient; it is also more convenient. We get a lot of vector and matrix operations for free,
which sometimes allow one to avoid unnecessary work. And they are also efficiently implemented.
o NumPy array is faster and we get a lot built in with NumPy, FFTs, convolutions, fast searching, basic statistics, linear
algebra, histograms, etc.
57) Mention what the Django templates consist of.
The template is a simple text file. It can create any text-based format like XML, CSV, HTML, etc. A template contains
variables that get replaced with values when the template is evaluated and tags (% tag %) that control the logic of the
template.
Django provides a session that lets the user store and retrieve data on a per-site-visitor basis. Django abstracts the process
of sending and receiving cookies, by placing a session ID cookie on the client side, and storing all the related data on the
server side.
So, the data itself is not stored client side. This is good from a security perspective.
To subset or filter data in SQL, we use WHERE and HAVING clauses. Consider the following movie table.
Using this table, let’s find the records for movies that were directed by Brad Bird.
Now, let’s filter the table for directors whose movies have an average duration greater than 115 minutes.
2. What is the difference between a WHERE clause and a HAVING clause in SQL?
Answer all of the given differences when this data analyst interview question is asked, and also give out the syntax for each
to prove your thorough knowledge to the interviewer.
WHERE HAVING
In the WHERE clause, the filter occurs before any HAVING is used to filter values
groupings are made. from a group.
WHERE condition;
ORDER BY column_name(s);
1. Is the below SQL query correct? If not, how will you rectify it?
The query stated above is incorrect as we cannot use the alias name while filtering data using the WHERE clause. It will
throw an error.
2. How are Union, Intersect, and Except used in SQL?
The Union operator combines the output of two or more SELECT statements. Syntax:
Let’s consider the following example, where there are two tables - Region 1 and Region 2.
The Intersect operator returns the common records that are the results of 2 or more SELECT statements. Syntax:
A Subquery in SQL is a query within another query. It is also known as a nested query or an inner query. Subqueries are used
to enhance the data to be queried by the main query.
Below is an example of a subquery that returns the name, email id, and phone number of an employee from Texas city.
WHERE emp_id IN (
SELECT emp_id FROM employee WHERE city = 'Texas');
4. Using the product_price table, write an SQL query to find the record with the fourth-highest market price.
Now, select the top one from the above result that is in ascending order of mkt_price.
5. From the product_price table, write an SQL query to find the total and average market price for each currency
where the average market price is greater than 100, and the currency is in INR or AUD.
The SQL query is as follows:
6. Using the product and sales order detail table, find the products with total units sold greater than 1.5 million.
We can use an inner join to get records from both the tables. We’ll join the tables based on a common key column, i.e.,
ProductID.
You must be prepared for this question thoroughly before your next data analyst interview. The stored procedure is an SQL
script that is used to run a task several times.
Let’s look at an example to create a stored procedure to find the sum of the first N natural numbers' squares.
8. Write an SQL stored procedure to find the total even number between two users given numbers.
Here is the output to print all even numbers between 30 and 45.
Tableau Data Analyst Interview Questions
LOD in Tableau stands for Level of Detail. It is an expression that is used to execute complex queries involving many
dimensions at the data sourcing level. Using LOD expression, you can find duplicate values, synchronize chart axes and
create bins on aggregated data.
Extract: Extract is an image of the data that will be extracted from the data source and placed into the Tableau repository.
This image(snapshot) can be refreshed periodically, fully, or incrementally.
Live: The live connection makes a direct connection to the data source. The data will be fetched straight from tables. So,
data is always up to date and consistent.
Joins in Tableau work similarly to the SQL join statement. Below are the types of joins that Tableau supports:
• Inner Join
13. What is a Gantt Chart in Tableau?
A Gantt chart in Tableau depicts the progress of value over the period, i.e., it shows the duration of events. It consists of
bars along with the time axis. The Gantt chart is mostly used as a project management tool where each bar is a measure of a
task in the project.
14. Using the Sample Superstore dataset, create a view in Tableau to analyze the sales, profit, and quantity sold across
different subcategories of items present under each category.
• Drag Category and Subcategory columns into Rows, and Sales on to Columns. It will result in a horizontal bar chart.
• Drag Profit on to Colour, and Quantity on to Label. Sort the Sales axis in descending order of the sum of sales within
each sub-category.
16. Create a dual-axis chart in Tableau to present Sales and Profit across different years using the Sample Superstore dataset.
• Drag the Order Date field from Dimensions on to Columns, and convert it into continuous Month.
• Drag Sales on to Rows, and Profits to the right corner of the view until you see a light green rectangle.
• Synchronize the right axis by right-clicking on the profit axis.
• Under the Marks card, change SUM(Sales) to Bar and SUM(Profit) to Line and adjust the size.
17. Design a view in Tableau to show State-wise Sales and Profit using the Sample Superstore dataset.
• Drag the Country field on to the view section and expand it to see the States.
• Increase the size of the bubbles, add a border, and halo color.