Oues.
Intro
Ans. Myself Deepanshu Bansal and I am from Noida, Uttar Pradesh. I have done my graduation, Bachelors of Computer Application from
Galgotias University and for past 3.2 years I have been working for mastercard as a Production support engineer and executing my expertise
in Linux, SQL and Shell Scripting for the betterment of the organisation.
So currently I am Working on Payment Acquiring Switch Project of Mastercard Inc where we act as a gateway between the merchant and
customer’s bank and Process a high volume of transactions between them.
Also Handle their transaction related issues such as payment failures or incomplete transactions. Manage all the aspects of payment such as
verification, authorization, routing, processing, and settlement of transactions between both the parties.
So My Roles and Respomnsibility in the project are - (if good then it should be in resume.)
Working on assigned tickets such as Service Task (reversal transaction i.e pending, add new card no. as per request), incidents, problem
Management in whick we face issues like payment failures or incomplete transactions (insufficieng fund, network error, error at bank’s end )
customer not able to make payments (expired or blocked), or permanent solution for any incident
Deploying new software releases as per Change Request.
Monitoring and Troubleshooting incidents and system alerts such as application down or strucked, mountpoints full or any other type of
performance issues
Transferring data file into database and linux server using Sql Loader and win scp tool.
Analyzing & Monitoring of scheduled or running jobs and system performance.
Maintaining technical documents like SOP.
Reviewing of data file receive on linux server from the switch end.
Checking for missing files if any then manually fetch them with the help of FTP.
Transferring of data files to the linux server using Win SCP.
Conduct root-cause analysis for issues and propose a corrective action plan as and when needed.
Ensure on-time delivery of all assigned ticketss- incidents, problem tickets etc within SLA Time.
Apart from that I do Report generation like audit reports
Ques. How we get work in your company ?
Ans. Generally, I receive alerts through mail and tickets on BMC Remedy tool,
then firstly I checks both the alerts and tickets that have been assigned to me after that I start working on them.
Oues. What does the ticket contain?
Ans 1. Ticket is a type of documents in which the work is mentioned.
2. Every ticket has unique no.
3. It keeps the records of every operation.
4. It is also used for tracking purpose.
Oues. Types of tickets?
Ans. 1. Task/service
2. Incident
3. Change request
4. Problem management
1. Task\Service :- It is a normal type of ticket in which we serve any demand or request by the client.
such as applaication down or strucked, performance issue, report generation or any other demand ensuring the client satisfaction.
2. Incident :- A Incident comes when the application does not work properly for the purpose it is used for.
In which we faced the error like data gathering and script errors due to which application slow or structed,
mountpoints full , any deadlock types of errors may occur.
Type of Incidents -: Incidents are Classified on the basis of priority of solving them with reference to time in SLA .
P1 = It has the highest priority among all and its SLA time is 4 hours.
P2 = has Lower priority than P1, and its SLA time is 8 hours.
P3 = also has Low priority, and its SLA time is 24 hours
P4 = has least priority, its SLA time is upto 30 hours.
Resolving any incident -:
Which Firstly we have to assign the ticket to our name and change its status to work in progress after that we have to analysis the problem
and check its priority and severity i.e the total impact of the issue on our application and then the communication will be created between
the teams then
1. Firstly we have to replicate the issue for checking does the incident actually happen or there is some mistake from the user or
client side.
2. If this step does not work then the next step will be troubleshooting in which -:
- if it is performance/slowness issue then we have to check cpu and memory utilization using top command, mountpoints level by
df -kh ,{can check in database for any query taking more time to execute}
- if it is a processing or any other issue then we have to check the related process i.e are they working properly or not.
3. Further we have to check the related logs with tail -f in which we can grep the errors on the basis of time and date
then after analysing the error if it is a -:
-Known error then we have to follow the steps mentioned in the Sop for solving it.
-Unknown error then we have to inform our manager, the manager will take a bridge call in which all the related teams
will participate and the issue is discussed and after getting the proper solution for the issue the incident will be in closed state.
Now we have ask for RCA - Root Cause Analysis in which Why the issue has been raised and how it has been solved is mentioned
It is given by the team that is responsible for the issue so that it can be mentioned in SOP for future help.
_________________________________________________________________________________________________________
Actions with reference to SOP (Some by Manager's permission and some by linux admin team or dba team)-:
- Try again or entering wrong details
- 1. Slowness - then zipping big files, deleting older files,{rebuilt the index in database} by MP
2. Other issue - resolve deadlock (due to which process is not running) by commit or rollback, if procees is slowing down then
by deleting and zipping files or can kill the process and restart by MP.
- Update in database entries ex- blocked or expired, resolve any deadlock by commit or rollback by DBA, resolve any process related
issue by killing and restat by MP, script issue by developer or L3 team.
3. Change Request - In this type of ticket, there is a request by the client for some minor and major changes in the existing
applicaion. Any upgrade in the application is also a Change request.
After getting any change request we have to inform to our manager and after the disscussion with manager,
a CAB is scheduled in which decision is taken that Why ever change is required and is it necessary or not.
After getting approval from the CAB metting the process will be started for deployment with reference with the
implementation plan given by development team.
Now firstly we have release a notification to all teams about the downtime, implementation plan, benefits and risks etc
A notification will also be realeased to all the related users about the downtime.
Now the implementation plan will be followed such as down all the services before that have to take ss of processes, cpu or memoriy
utilisation and backup of old patch is taken then the new patch is inserted after completion of deployment all the services are up
again and then we have to do a sanity check for checking that the application is working properly or not, if not then the implementation
plan is followed for rollbacking all the process, if everything is fine then the completion notification is released and the
change request ticket is closed.
- Can do sanity check by checking database entries by select count(*) for checking the data entries or can match
the ss that has been taken previously.
- Altering any table is also a change request.
4. Problem Management : - If a incident is coming again and again then it will be resolved under problem management type of ticket for
Which Firstly we have to assign the ticket to our name and change its status to work in progress after that we have to analysis the problem
and check its priority and severity i.e the total impact of the issue on our application and then the communication will be created between
the teams where we have to find its RCA - Root CAUSE ANALYSIS that why the incident is coming again and again and check if the
situation could be handled by us or not if can be handle then fix it and after verifying the situation a notification will be
released that the that issue has been resolved and application is working properly.
If the problem could not be handle by us then we have to share the RCA document to the developer team on which they will work for
finding the permanent Solution, during this process we have to take follow up from them about the progress and after getting the
permanent solution it will be in testing state and then a change request is created for the deployment process and then the change request
process is followed normally.
Oues- Biggest challenge ?
Ans- I was working in my shift, suddenly I saw my dashboard that the pending count of CDR files processing is increasing and the processed
count was same as before. So I started the troubleshooting steps such as server connectivity and Cpu/memory utilization but all these were
going good so i checked the logs but there was multiple errors thats why i was not able to grep the actual one and the time
was going on continuously so I inform my manager about that issue and the manager take the incident management team on a bridge call
with all others teams and while we were discussing the issue suddenly I saw that some logs were not generating so i checked the mount
points space by df - kh where I found that one Mp was approximately full due to which logs files were not generating so i asked my
manager for permission to delete unnecessary files and zipped some necessary files after getting the permission when I started zipping
the files I was not able to zip files as the MP was 100% full and for zipping it needed some space so then I move some files and directories
to another Mp and aftrer zipping I got them back at the original location.
After all this process the Mp space got released and logs started generating again and processed count got increased and pending got
stable.A sanity check was also done for checking of proper functions. And After some that it was disclosed that someone has disabled the
script of mount points alerts during the deployment and forgot to enable it after completion which leads for not getting of alerts when
the MP was 75% full ealear.
RCA - Mount point was full due to which the logs were not generating which was leading non processing of files .
- ensuring enable of alerts script after deployment.
- ensuring the script of deletion the older files running properly.
Oues- Why the alerts were not comming even the mount point was 75% full ?
Ans- Because someone has disabled the script of mount points full alerts during the deployment process and
forgot to enable it after completion.
Oues- What alerts are disable during deployment ?
Ans- As during the deployment process the application is down due to which there may be so many alerts.
_________________________________________________________________________________________________________________
RCA - Root Cause Analysis in which Why the issue has been raised and how it has been solved is mentioned
SLA - Service level Agreement between the client and the service provider(us) in which the conditions and time for solving any incident is mentioned
OLA – Operational level agreement between two or more parts/teams of the same organisation.
SOP – Standard operating Procedure in which the steps for resolving any incident is mentioned.
ITIL – It is a framework designed for the better functioning of processes like selection, planning, delivery and maintainance of overall
lifecycle of it services
Severity is basically a parameter that denotes the total impact of a issue on the application.
Priority is basically a parameter that defines the importance of any issue In order to get the order in which it should be resolved.
SDLC – The Software Development Life Cycle (SDLC) is a structured process that enables the production of high-
quality, low-cost software, in the shortest possible production [Link] manager all the aspects of any software from
development to productin environment.
_________________________________________________________________________________________________________________
Oues- What is unnecessary file?
Ans- 90\60 days old files are unnecessary files if they are not deleted timely then mountpoint 100% full alert will be raised.
Ques Why did not the alert come even after having 100% mount point full?
Ans - Deployment going that why alert is disable i.e. must comment the alerts scripts in crontab.
Ques- Shift?
Ans- Rotational Shift
* Morning - 6am to 3pm
* Evening - 2pm to 11pm
* Night - 10pm to 7am
- 1 hour - hand over
- Support - 24*7
- Holiday - Rotational base.
Oues- Team Members?
Ans- 8 member - 1 team lead | manager
4 L1 and 3 L2
L1 - monitoring (fresher)
L2 - work on incident and check the bugs in code.
L3 - are Like admins, any change in code, work on big incident sometimes.
Ques- Late night P1 incident What will we do?
Ans- Acknowledge to client
Inform to manager and IMT
If manager does not pick the call, then follow execlation matrix
On the way we will do troubleshooting such as process check, CPU/Memory utilization, Disk Utilisation, log check,
known error then follow SOP, unknown error bridge call.
Ques- If server get down?
Ans- Connect server with putty
check process PS -ef |grep " "
CPU or memory utilization by Top
Disk utilization by df -kh.
Check logs by grep "error" log_file_name, Otherwise inform to IMT and manager.
Ques- When the process get slowdown?
Ans- Check processes with PS -ef
Then kill process with the permission of manager.
Ques- When the application get slowdown?
Ans- Then we check CPU/ Memory utilization (if full then zip or delete file by MP) or checks disk utilization.
Oues- Application we used?
Ans- Merchant Identifier
_________________________________________________________________________________________________________________
- Bin (Bucket) contains all open & work in progress tickets.
- Ticketing Tool – Service now
- Status of Tickets -:
Open when come
Work in progress after Assigned
On hold/pending during insufficient info.
resolved/closed
- SOP - Standard Operating Procedure, contains the steps for resolving any incident
- Mapping (Ticket) - Form type UI - wrong mapping can lead to an incident
- Production environment - Directly used by users so be careful - EX - if recharge table is updated then the recharge will be successfully.
- Preproduction environment - Used for testing
- IMT (incident management team) - it Release notification to all teams during change request.
- SLA - Service Level agreement
Contains Time period to resolve any ticket
Agreement b/w service provider and client.
- Monitoring tools - Grafana, Kibana or batch manager (graphs of mountpoint storages).
- Exclation matrix - Basically this is type of list of employees and their numbers who are available
at that time when we get incident such as
-Manager
-T/L
-Employees
-Other team - DBA team, Linux team, Networking team.
- Report Generation- Open database then fetch the desired data by select query then transfer it into an excel file then generate report for
sending.
- SQL Loader - It is used to load data from on external file into table in the database.
- Sanity check - Process of checking whether the application is working properly or not after the deployment.
- Types of SLA
- Service based
- Customer based
- multi level
- CAB - Change advisory board
Happen one time in a week, on Tuesday
- System health - CPU, Memory, or disk utilization.
- Roaster - In the beginning of the month we make roaster means i.e. shift time table for month
Oues- Second Biggest challenge or Biggest Achivement or Incident face or issue face ?
Ans- I was working in my shift and I received an alert that the application gets stocked so I started troubleshooting steps.
I checked server connectivity and CPU utilization and checked the logs, but I was not able to get the reason then I informed
to my manager and manager call to incident management team on bridge call with all others team.
During this I suddenly found a deadlock While running the query in database so I informed to dba team and dba
team release the deadlock then We restart the application and the incident get closed
RCA - Deadlock due to which the application got stocked
- ensuring proper commit and rollback of DML operations
Oues- Project based biggest challenge?
Ans- I was working in my shift, then suddenly I saw my dashboard that the pending count of CDR files processing is incresing and
the processed count was same as before. So, I inform to my manager about the issue then a bridge call is initiated during this
I cleck the database where I found a query is taking longer time for result due to which the processing was slow.
So, then I asked my manager for rebuilding its index after rebuilding the query started working properly and the slowness in the
application got resolved and the process count raised and pending get stable. So, this was my biggest challenge.
RCA - Index was full due to which the application gets slower
- ensuring time to time rebuilt of indexes.