My research focuses on testing, developing and understanding critical software systems and ways to combine testing, static analysis and machine learning approaches for coming up with better tools and techniques. I have been working on using static code analysis for identifying factors related to code, developer and process that affect the quality of the software measured in terms of bugs and design issues. By examining over 200 real world projects of various sizes,we have shown that projects suffer from large number of design issues and it gets worse over time. Just adding more developers or resources doesn’t solve the issue rather makes it worse. Awareness about code smells would benefit developers in controlling the design degradation and eventually reduces the technical debt of a project.

My most recent research is on exploring the effectiveness of mutation analysis of programs and especially how to make mutation analysis a workable technique for real-world developers and testers. Mutation analysis is a reasonable proxy for measuring the effectiveness of test suites, but its also computationally and time intensive. Even a moderately large software project would require millions of test suite runs. This makes mutation analysis impossible to use by developers and practicing testers working on real-world problems. My research focuses on how we can scale mutation analysis to real world complex software systems.

My initial research shows that we can successfully and efficiently apply mutation analysis on systems like Linux kernel where we can reduce the number of mutants and eventually test runs by identifying equivalent and duplicate mutants and scale this technique. Our effort even identified 4 bugs in the Linux kernel. Currently I am working on a combined technique of mutant prioritization and automatic mutation analysis, which would result in a tool that application developers can use with minimum effort and can integrate in their development toolchain.


  • IBM Ph.D. Fellowship for academic year 2016-2017.
  • Graduate School tuition relief Scholarship for academic year 2016-2017
  • IBM Ph.D. Fellowship for academic year 2017-2018.

Conference Publications

More Publications

Free/Open Source Software developers come from a myriad of different backgrounds, and are driven to contribute to projects for a variety of different reasons, including compensation from corporations or foundations. Motivation can have a dramatic impact on how and what contribution an individual makes, as well as how tenacious they are. These contributions may align with the needs of the developer, the community, the organization funding the developer, or all of the above. Understanding how corporate sponsorship affects the social dynamics and evolution of Free/Open Source code and community is critical to fostering healthy communities. We present a case study of corporations contributing to the Linux Kernel. We find that corporate contributors contribute more code, but are less likely to participate in non-coding activities. This knowledge will help project leaders to better understand the dynamics of sponsorship, and help to steer resources.
In VL/HCC, 2017.

Background: Merge conflicts are a common occurrence in software development. Researchers have shown the negative impact of conflicts on the resulting code quality and the development workflow. Thus far, no one has investigated the effect of bad design (code smells) on merge conflicts. Aims: We posit that entities that exhibit certain types of code smells are more likely to be involved in a merge conflict. We also postulate that code elements that are both “smelly” and involved in a merge conflict are associated with other undesirable effects (more likely to be buggy). Method: We mined 143 repositories from GitHub and recreated 6,979 merge conflicts to obtain metrics about code changes and conflicts. We categorized conflicts into semantic or non-semantic, based on whether changes affected the Abstract Syntax Tree. For each conflicting change, we calculate the number of code smells and the number of future bug-fixes associated with the affected lines of code. Results: We found that entities that are smelly are three times more likely to be involved in merge conflicts. Method-level code smells (Blob Operation and Internal Duplication) are highly correlated with semantic conflicts. We also found that code that is smelly and experiences merge conflicts is more likely to be buggy. Conclusion: Bad code design not only impacts maintainability, it also impacts the day to day operations of a project, such as merging contributions, and negatively impacts the quality of the resulting code. Our findings indicate that research is needed to identify better ways to support merge conflict resolution to minimize its effect on code quality.
In ESEM, 2017.

Mutation analysis is an established technique for measuring the completeness and quality of a test suite. Despite four decades of research on this technique, its use in large systems is still rare, in part due to computational requirements and high numbers of false positives. We present our experiences using mutation analysis on the Linux kernel’s RCU (Read Copy Update) module, where we adapt existing techniques to constrain the complexity and computation requirements. We show that mutation analysis can be a useful tool, uncovering gaps in even well-tested modules like RCU. This experiment has so far led to the identification of 3 gaps in the RCU test harness, and 2 bugs in the RCU module masked by those gaps. We argue that mutation testing can and should be more extensively used in practice.
In ICSTW, 2017.

Among the major questions that a practicing tester faces are deciding where to focus additional testing effort, and deciding when to stop testing. Test the least-tested code, and stop when all code is well-tested, is a reasonable answer. Many measures of ‘testedness’ have been proposed; unfortunately, we do not know whether these are truly effective. In this paper we propose a novel evaluation of two of the most important and widely-used measures of test suite quality. The first measure is statement coverage, the simplest and best-known code coverage measure. The second measure is mutation score, a supposedly more powerful, though expensive, measure.We evaluate these measures using the actual criteria of interest: if a program element is (by these measures) well tested at a given point in time, it should require fewer future bug-fixes than a ‘poorly tested’ element. If not, then it seems likely that we are not effectively measuring testedness. Using a large number of open source Java programs from Github and Apache, we show that both statement coverage and mutation score have only a weak negative correlation with bug-fixes. Despite the lack of strong correlation, there are statistically and practically significant differences between program elements for various binary criteria. Program elements (other than classes) covered by any test case see about half as many bug-fixes as those not covered, and a similar line can be drawn for mutation score thresholds. Our results have important implications for both software engineering practice and research evaluation.
In FSE, 2016.

Code smells are associated with poor coding practices that cause long-term maintainability problems and mask bugs. Despite mobile being a fast growing software sector, code smells in mobile applications have been understudied. We do not know how code smells in mobile applications compare to those in desktop applications, and how code smells are affecting the design of mobile applications. Without such knowledge, application developers, tool builders, and researchers cannot improve the practice and state of the art of mobile development. We first reviewed the literature on code smells in Android applications and found that there is a significant gap between the most studied code smells in literature and most frequently occurring code smells in real world applications. Inspired by this finding, we conducted a large scale empirical study to compare the type, density, and distribution of code smells in mobile vs. desktop applications. We analyze an open-source corpus of 500 Android applications (total of 6.7M LOC) and 750 desktop Java applications (total of 16M LOC), and compare 14,553 instances of code smells in Android applications to 117,557 instances of code smells in desktop applications. We find that, despite mobile applications having different structure and workflow than desktop applications, the variety and density of code smells is similar. However, the distribution of code smells is different – some code smells occur more frequently in mobile applications. We also found that different categories of Android applications have different code smell distributions. We highlight several implications of our study for application developers, tool builders, and researchers.

Redundancy in mutants, where multiple mutants end up producing the same semantic variant of a program, is a major problem in mutation analysis. Hence, a measure of effectiveness that accounts for redundancy is an essential tool for evaluating mutation tools, new operators, and reduction techniques. Previous research suggests using the size of the disjoint mutant set as an effectiveness measure. We start from a simple premise: test suites need to be judged on both the number of unique variations in specifications they detect (as a variation measure), and also on how good they are at detecting hard-to-find faults (as a measure of thoroughness). Hence, any set of mutants should be judged by how well it supports these measurements. We show that the disjoint mutant set has two major inadequacies — the single variant assumption and the large test suite assumption — when used as a measure of effectiveness in variation. These stem from its reliance on minimal test suites. We show that when used to emulate hard to find bugs (as a measure of thoroughness), disjoint mutant set discards useful mutants. We propose two alternatives: one measures variation and is not vulnerable to either the single variant assumption or the large test suite assumption; the other measures thoroughness.
In ICSTW, 2016.

Although mutation analysis is considered the best way to evaluate the effectiveness of a test suite, hefty computa- tional cost often limits its use. To address this problem, var- ious mutation reduction strategies have been proposed, all seeking to reduce the number of mutants while maintaining the representativeness of an exhaustive mutation analysis. While research has focused on the reduction achieved, the effectiveness of these strategies in selecting representative mutants, and the limits in doing so have not been investi- gated, either theoretically or empirically. We investigate the practical limits to the effectiveness of mutation reduction strategies, and provide a simple theoret- ical framework for thinking about the absolute limits. Our results show that the limit in improvement of effectiveness over random sampling for real-world open source programs is a mean of only 13.078%. Interestingly, there is no limit to the improvement that can be made by addition of new mutation operators. Given that this is the maximum that can be achieved with perfect advance knowledge of mutation kills, what can be practically achieved may be much worse. We conclude that more effort should be focused on enhancing mutations than removing operators in the name of selective mutation for questionable benefit.
In ICSE, 2016.

Context: Software decay is a key concern for large, long-lived software projects. Systems degrade over time as design and implementation compromises and exceptions pile up.Goal: Quantify design decay and understand how software projects deal with this issue.Method: We conducted an empirical study on the presence and evolution of code smells, used as an indicator of design degradation in 220 open source projects.Results: The best approach to maintain the quality of a project is to spend time reducing both software defects (bugs) and design issues (refactoring). We found that design issues are frequently ignored in favor of fixing defects. We also found that design issues have a higher chance of being fixed in the early stages of a project, and that efforts to correct these stall as projects mature and the code base grows, leading to a build-up of problems.Conclusions: From studying a large set of open source projects, our research suggests that while core contributors tend to fix design issues more often than non-core contributors, there is no difference once the relative quantity of commits is accounted for. We also show that design issues tend to build up over time.
In ESEM, 2015.

Mutation analysis is considered the best method for measuring the adequacy of test suites. However, the number of test runs required for a full mutation analysis grows faster than project size, which is not feasible for real-world software projects, which often have more than a million lines of code. It is for projects of this size, however, that developers most need a method for evaluating the efficacy of a test suite. Various strategies have been proposed to deal with the explosion of mutants. However, these strategies at best reduce the number of mutants required to a fraction of overall mutants, which still grows with program size. Running, e.g., 5% of all mutants of a 2MLOC program usually requires analyzing over 100,000 mutants. Similarly, while various approaches have been proposed to tackle equivalent mutants, none completely eliminate the problem, and the fraction of equivalent mutants remaining is hard to estimate, often requiring manual analysis of equivalence. In this paper, we provide both theoretical analysis and empirical evidence that a small constant sample of mutants yields statistically similar results to running a full mutation analysis, regardless of the size of the program or similarity between mutants. We show that a similar approach, using a constant sample of inputs can estimate the degree of stubbornness in mutants remaining to a high degree of statistical confidence, and provide a mutation analysis framework for Python that incorporates the analysis of stubbornness of mutants.
In ISSRE, 2015.

Formal verification has advanced to the point that developers can verify the correctness of small, critical modules. Unfortunately, despite considerable efforts, determining if a “verification” verifies what the author intends is still difficult. Previous approaches are difficult to understand and often limited in applicability. Developers need verification coverage in terms of the software they are verifying, not model checking diagnostics. We propose a methodology to allow developers to determine (and correct) what it is that they have verified, and tools to support that methodology. Our basic approach is based on a novel variation of mutation analysis and the idea of verification driven by falsification. We use the CBMC model checker to show that this approach is applicable not only to simple data structures and sorting routines, and verification of a routine in Mozilla’s JavaScript engine, but to understanding an ongoing effort to verify the Linux kernel Read-Copy-Update (RCU) mechanism.
In ASE, 2015.

It is a widely held belief that Free/Open Source Software (FOSS) development leads to the creation of software with the same, if not higher quality compared to that created using proprietary software development models. However there is little research on evaluating the quality of FOSS code, and the impact of project characteristics such as age, number of core developers, code-base size, etc. In this exploratory study, we examined 110 FOSS projects, measuring the quality of the code and architectural design using code smells. We found that, contrary to our expectations, the overall quality of the code is not affected by the size of the code base, but that it was negatively impacted by the growth of the number of code contributors. Our results also show that projects with more core developers don’t necessarily have better code quality.
In OSS, 2014.

Free/Open Source Software projects often rely on users submitting bug reports. However, reports submitted by novice users may lack information critical to developers, and the process may be intimidating and difficult. To gather more and better data, projects deploy automatic crash reporting tools, which capture stack traces and memory dumps when a crash occurs. These systems potentially generate large volumes of data, which may overwhelm developers, and their presence may discourage users from submitting traditional bug reports. In this paper, we examine Mozilla’s automatic crash reporting system and how it affects their bug triaging process. We find that fewer than 0.00009% of crash reports end up in a bug report, but as many as 2.33% of bug reports have data from crash reports added. Feedback from developers shows that despite some problems, these systems are valuable. We conclude with a discussion of the pros and cons of automatic crash reporting systems.
In OpenSym, 2014.

Journal Publications

Mutation analysis is a well-known yet unfortunately costly method for measuring test suite quality. Researchers have proposed numerous mutation reduction strategies in order to reduce the high cost of mutation analysis, while preserving the representativeness of the original set of mutants. As mutation reduction is an area of active research, it is important to understand the limits of possible improvements. We theoretically and empirically investigate the limits of improvement in effectiveness from using mutation reduction strategies compared to random sampling. Using real-world open source programs as subjects, we find an absolute limit in improvement of effectiveness over random sampling —13.078%. Given our findings with respect to absolute limits, one may ask: how effective are the extant mutation reduction strategies? We evaluate the effectiveness of multiple mutation reduction strategies in comparison to random sampling. We find that none of the mutation reduction strategies evaluated —many forms of operator selection, and stratified sampling (on operators or program elements) —produced an effectiveness advantage larger than 5% in comparison with random sampling. Given the poor performance of mutation selection strategies — they may have a negligible advantage at best, and often perform worse than random sampling – we caution practicing testers against applying mutation reduction strategies without adequate justification.
In TR, 2017.

Though mutation analysis is the primary means of evaluating the quality of test suites, though it suffers from inadequate standardization. Mutation analysis tools vary based on language, when mutants are generated (phase of compilation), and target audience. Mutation tools rarely implement the complete set of operators proposed in the literature, and most implement at least a few domain-specific mutation operators. Thus different tools may not always agree on the mutant kills of a test suite, and few criteria exist to guide a practitioner in choosing a tool, or a researcher in comparing previous results. We investigate an ensemble of measures such as traditional difficulty of detection, strength of minimal sets, diversity of mutants, as well as the information carried by the mutants produced, to evaluate the efficacy of mutant sets. By these measures, mutation tools rarely agree, often with large differences, and the variation due to project, even after accounting for difference due to test suites, is significant. However, the mean difference between tools is very small indicating that no single tool consistently skews mutation scores high or low for all projects. These results suggest that research using a single tool, a small number of projects, or small increments in mutation score may not yield reliable results. There is a clear need for greater standardization of mutation analysis; we propose one approach for such a standardization.
In SQJ, 2016.

Technical report

. Does the Choice of Mutation Tool Matter?. In Technical Report, 2015.

Details PDF

. An Empirical Comparison of Mutant Selection Approaches . In Technical Report, 2014.

Details PDF

Invited Talks

  • How to apply mutation testing to the RCU for fun and profit. Linux Plumbers Conference 2015

  • How to apply mutation testing to the RCU for fun and profit: A progress report. Linux Plumbers Conference 2016


  • Reviewer, IEEE Transactions on Reliability,2017
  • Reviewer, Software Testing, Verification and Reliability Journal,2017
  • Reviewer, IEEE/ACM International Conference on Automated Software Engineering (ASE),2017
  • Reviewer, IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC),2017


Large Scale Mutation Analysis

Imprroving the reliability of Linux kernel by applying mutation analysis


Teaching Experience

Instructor, Computer Science Department, Oregon State University

  • CS361: Software Engineering I
  • CS362: Software Engineering II
  • CS275: Introduction to Databases

Teaching Assistant,Computer Science Department, Oregon State University

  • CS440: Database Management Systems
  • CS275: Introduction to Databases
  • CS361: Software Engineering I
  • CS362: Software Engineering II

Industry Experience

Senior Executive, Operations/Technology, Grameenphone Ltd., Dhaka, Bangladesh

  • Duration: May 2010-Augst 2011
  • Major projects:
  • Developed distributed mobile Commerce Ticketing solution using jsp and oracle on UMB interface.
  • Web Based mobile airtime recharge application.
  • End to End system requirement analysis and finalization of Mobile Commerce solutions.
  • Developed Website Scrapping and Data Extraction tool using php.

System Engineer, IT Operations, Grameenphone Ltd. Dhaka, Bangladesh

  • Duration: June 2007-May 2008
  • Major projects:
  • Developed Issue tracking system for efficient support service for IT helpdesk engineers.