Strings considered harmful

Introduction

With our Inline service we support Agile development teams with building great, but secure applications. A large part of the work is performing code reviews to catch bugs or flaws (see this great blogpost outlining the differences) with security ramifications before they reach production.

The code reviews provide a great opportunity to catch issues early, but also present a great opportunity to advise software engineers on how and when to apply security best-practices. From our experience these interactions really help the team in becoming more security aware.

The most well-known security best-practices we often preach are input validation and contextual output encoding. Many frameworks provide support to aid in both tasks, but today I would like to discuss how the choice of types to represent data can help or hurt your application in the case of input validation and contextual output encoding.

To help the narrative I would like to limit my examples to a RESTful service written in Java. However, the issues discussed apply to other applications processing user-supplied input written in programming languages with a static type system (e.g., C#, Scala, Go, Kotlin, etc). In Java strings are represented by the String class and can be constructed using the StringBuilder class.

As revealed by the title of this blogpost strings are central to the discussion and how their use is considered harmful. We consider their use harmful for input validation because they violate two properties we like for validated data:

  1. validation should happen in one place to prevent diversion, and
  2. once validated, we should have a strong guarantee in future uses that we are using validated data.

The harmful use of strings in the case of contextual output encoding will be clarified with examples.

Input validation

Input validation is the first measure that, when performed correctly, protects your application and other services called by your application from malicious users trying to attack your application. If your application uses user-supplied input to, for example, build an SQL query without using parameterized statements input validation can save the day. We have encountered many applications with security issues that we could not exploit because of proper input validation.

Input validation should happen as early as possible. Which in case of a RESTful service is the controller class handling a request. The following listing shows an example of a controller with a method handling a request as is commonly encountered in our code reviews.

@Path("/employee/{dept}/{id}")
public class EmployeeController {
	@GET
	@Produces("application/json;charset=UTF-8")
	public Employee getEmployee(
	@PathParam("dept") String department
	@PathParam("id") String id) {
		/* some processing */
	}
}

This example shows the use of the strings to hold the department and employee id of the employee we are looking for. Validation can take place in either the method body by for example using a validation function or by using Bean validation that adds validation based on annotations.

Both implementations have their pros and cons and can successfully thwart an attack by a malicious user. However, after the validation each other method that receives the string should perform validation again before using the input because it cannot be guaranteed to be valid! Unfortunately the string type only guarantees that the content is, in the case of Java, an UTF-16 encoded string. Any path to vulnerable code missing the validation will result in an exploitable vulnerability.

So how can we improve?

In the case of Java (or any other statically typed language) we advise to use the type-system to ensure the validity of user-supplied input. That is, create your own types that represent the information and that perform validation when being constructed from a string. Using a class to represent information instead of using a string has two benefits:

  • Validation is encapsulated, doesn't diverge (e.g., multiple validation routines with slightly different checks), and is easy maintainable.
  • Once validated, the type will guarantee valid data in future uses.

These classes can be easily shared among applications that have to deal with the same data. The following example shows how this would look with Bean validation on user-defined types

 class Department {
	@NotEmpty
	@Pattern(regexp="^DEP-\\d{5}$")
	private String dept;
	/* omitted getters and setters */
}
	
class Email {
	@NotEmpty
	@Email
	private String email;
	/* omitted getters and setters */
}
	
@Path("/employee/{dept}/{id}")
public class EmployeeController {
	@GET
	@Produces("application/json;charset=UTF-8")
	public Employee getEmployee(
	@PathParam("dept") @Valid Department department
	@PathParam("id") @Valid Email id) {
		/* any subsequent code using these instances get the guarantee the supplied input is validated! */
	}
}

The downside of this approach is that validation has to be invoked externally (in the above example done via the @Valid annotation that tells the used validation framework to process the validation annotations). This means that execution paths manually constructing the Department or Email object can contain unvalidated data!

This is only a minor improvement compared to using a String, namely the encapsulation of validation. We still lack a strong guarantee that an object contains validated data. Programmatic validation of an object can be done. Delegating the object creation to a factory method that invokes the validation ensures that the data is always validated.

class Email {
	@NotEmpty
	@Email
	private String email;
	private Email(String email){this.email = email}
	public static Email fromString(String email) {
		Email e = new Email(email);
		/* This is expensive, should be cached somewhere */
		ValidatorFactory factory = Validation.buildDefaultValidatorFactory();
		Validator validator = factory.getValidator();
		Set<ConstraintViolation<Email>> violations = validator.validate(e);
		factory.close();
		/* handle violations */
	}
}

The following example shows how this would look with user-defined types that perform validation.

 class Department {
	private String dept;
	public Email(String dept) {
		if (isValid(dept)) {
			this.dept = dept;
		} else {
			throw new IllegalArgumentException("Invalid department");
		}
	}
	boolean isValid(String dept) {
		/* details omitted */
	}
}
	
class Email {
	private String email;
	public Email(String email) {
		if (isValid(email)) {
			this.email = email;
		} else {
			throw new IllegalArgumentException("Invalid email");
		}
	}
	boolean isValid(String email) {
		/* details omitted */
	}
}
	
@Path("/employee/{dept}/{id}")
public class EmployeeController {
	@GET
	@Produces("application/json;charset=UTF-8")
	public Employee getEmployee(
	@PathParam("dept") Department department
	@PathParam("id") Email id) {
		/* any subsequent code using this classes get the guarantee the supplied input is validated! */
	}
}

This approach both encapsulates the validation and ensures that it is always valid by validating on construction.

Contextual output encoding

With input validation covered we end up with contextual output encoding. Even with strong input validation we can encounter situations were validated data can be used to exploit your application. A certain character that is allowed and passes validation can be abused in some context to manipulate the application. Validation alone is not sufficient. The most well-known example is SQL injection. SQL queries are commonly constructed from strings. In Java this looks somewhat like.

StringBuilder sb = new StringBuilder();
sb.append("SELECT * FROM employees WHERE lastname = '");
sb.append(lastname);
sb.append("'');
	
String q = sb.toString();
/* some details omitted */
ResultSet rs = stmt.executeQuery(query);

Not properly encoding data in the context of its use, in this case SQL, opens up the application to attacks. In the example above an attack can perform an automated SQL injection attack using for example sqlmap and disclose data from the database (or worse compromise the database server).

So for non-dynamic queries the StringBuilder usually signals something is going wrong. I do reckon that while SQL injection is one of the most well-known examples of output decoding going wrong it doesn't really show how strings can be harmful, because prepared statements are still constructed using strings. But consider the following examples:

XML construction using StringBuilder

StringBuilder sb = new StringBuilder();
sb.append("<employee>");
sb.append(<lastname>);
sb.append(lastname);
sb.append("</employee>");
sb.append(</lastname>);
String xml = sb.toString();

JSON construction using StringBuilder

StringBuilder sb = new StringBuilder();
sb.append("{\"employee\": {\"");
sb.append("\"lastname\":">);
sb.append(lastname);
sb.append("\"}}");
String json = sb.toString();

URL construction using StringBuilder

StringBuilder sb = new StringBuilder();
sb.append("https://internal-employee-service/search?lastname=");
sb.append(lastname);
String url = sb.toString();

In particular the URL construction is something we encounter often and opens up the application to HTTP Parameter pollution attacks if constructed as in the example. The correct way would be to use a type that represents a URL and automatically encodes the correct parts. For the URL example this would be the class UriBuilder.

UriBuilder ub = new UriBuilder();
URL url = ub.uri("https://internal-employee-service/search")
	.queryParam("lastname", lastname)
	.build()
	.toURL();

This all can be achieved with strings and encoding routines used in the correct context, but leaves this up to the software engineer to properly apply the correct encoding. In our experience this increases the likelihood of a mistake leading to a vulnerable application. It also makes the jobs of code reviewers harder because we need to verify each time if the correct encoding is applied.

Conclusion

We hope this blogpost has convinced you why we consider strings harmful and that using types can help protecting your application from common attacks. It also makes your application easier to review by security engineers, which in the end should result in great and secure software!

Questions or feedback?