Reverse Engineering Software Systems

OVERVIEW

A software system’s life cycle is dominated by maintenance costs and efforts. A software system’s architecture is acknowledged as a key determinant of the system’s properties and its successful maintenance and evolution over the lifetime of a system. However, in the area of architecture-centered software maintenance, empirical research, and technology transfer from academia to practice have been impeded by disjoint environments, redundant efforts, high costs associated with developing robust tools, and the lack of shared research infrastructure and datasets. To address these challenges, this project develops the Software Architecture INstrument (SAIN), a first-of-its-kind integration framework for assembling architecture-related techniques and tools to enable empirical research in the context of software maintenance. SAIN will deliver a tool suite comprising four principal components: (1) a cataloged library of cutting-edge tools for reverse engineering and analyzing software systems’ architectures, which increases reusability and eliminates redundant tool development across the community; (2) a plug-and-play instrument for integrating the tools and techniques to promote interoperability of existing solutions and enable the creation of new solutions; (3) reproducibility wizards to set up experiment-templates, produce replication packages, and release them in easy-to-run and modify formats, which promotes wide accessibility and smooth usage of existing techniques by researchers and practitioners; and (4) a public repository of software-architecture datasets and benchmarks for supporting a broad range of software-architecture empirical studies.

SAIN aims to bridge the gap between academic research and practice in the software-architecture domain. On the one hand, SAIN will enable extensive empirical research by providing a large repository of architectural artifacts, including interoperable tools and benchmark datasets. As such, researchers will be able to compare and contrast different techniques using the same datasets to identify gaps and inaccuracies. This will enable new solutions for improving the state-of-the-art in software-architecture research. On the other hand, SAIN will provide practitioners with an authoritative source offering interoperable tools and feedback, as well as a channel to contribute cutting-edge architectural artifacts. In summary, SAIN has the potential to transform software architecture research and practice by (1) facilitating the discovery and adoption of cutting-edge techniques and tools that are best-suited to modern problems and (2) ensuring architecture’s central role in a broad range of software engineering activities. SAIN will be available for public use and will foster much more effective university-industry collaboration than is the case today.

Identification of Security Tactics in Code

Secure by Design has become the mainstream development approach to ensure the security and privacy of software systems. During the software development life cycle, security requirements must be addressed from the ground up with a robust architectural design perspective. Critical design decisions are often based on well-known security tactics[1], defined as reusable techniques to achieve specific quality concerns. And these tactics provide solutions to enforce authentication, authorization, confidentiality, data integrity, privacy, accountability, availability, safety, and non-repudiation requirements, even when the system is under attack[2]. Architectural security tactics need to be identified in the early stages of the design process and then implemented alongside functional features[3, 4, 5, 6]. However, in practice, this is not always the case, as architectural solutions often evolve during the development process[7, 8]. Prior work shows that security breaches in many software applications are due to architectural flaws[9, 10]. Identifying security-related architectural design problems in a given software system is critical, to make sure that the system is not vulnerable to cyberattacks.

 

Figure 1: Categorization of studied security tactics. Yellow nodes show tactic categories. Green nodes represent tactics that have at least 25 related code snippets in Stack Overflow. Gray nodes represent tactics that had less than 25 related code snippets and are not included in experiments.

We present a novel approach that leverages Machine Learning methods to reverse engineer the source code of a software system and pinpoint modules that implement security tactics. Getting inspired from the five functions defined by the National Institute of Standards and Technology (NIST) Cybersecurity Framework[11], we generate a comprehensive list of commonly used security tactics and categorizes them into different branches by identifying whether they are used to Detect, Prevent, or React to a security incident. A security tactic tree is created by further classifying those branches into various sub-branches based on commonly referred security needs addressing data integrity, authenticate, encrypt, decrypt, authorize, and secure communication. Security-related and unrelated code snippets are pulled from the online Question and Answer platform StackOverflow (http://stackoverflow.com/) and passed through a thorough review process to generate a data set for each security tactic that includes an equal number of related and unrelated code snippets. A set of experiments are designed and run by fine-tuning the state-of-the-art Natural Language Processing (NLP) model Bidirectional Encoder Representations from Transformers (BERT) to identify tactic related code snippets. Experiment results indicate that code pieces that implement security tactics could be identified with F-Measure values up to 0.98.