10. Strings#

C has intgral types like char, int, long and long long, floating-point types like float and double. However, to treat a sequence of characters which is also called string no new data type is needed. An array of characters or a pointer to character can be used to represent strings. A C string is a sequence of characters stored in contiguous memory locations ended by 0 or \0 or NULL. It is mandatory for strings to have this ending else you will have surprises. A C string typically have one of the following declarations:

const char str1[] = "Some string";
char str2[16] = "Some string";
const char* str3 = "Some string";
char* str3=NULL; // Allocate and fill with characters later

The first three strings will be on stack while the last one will be on heap as we will need to use malloc to allocate memory for it. Since a C string is either an array or pointer you can use [] operator to get characters by index from string.

You can read a string from the stdin i.e. keyboard using deprecated and unsecure gets function or secure fgets function as we have seen in console I/O chapter. We can use realloc function to read an infinite string as shown below.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stddef.h>

int main()
{
  char *inf_str = (char*)calloc(16, sizeof(char));
  char c=0;
  size_t i = 0, j = 1;

  while((c=getchar()) != '\n') {
    if(i%16 == 0) {
      j++;
      inf_str = (char*)realloc(inf_str, sizeof(char)*16*j);
    }
    inf_str[i++] = c;
  }

  inf_str[i++] = 0;
  puts(inf_str);
  free(inf_str);

  return 0;
}

Note that we are allocating 16 bytes to avoid allocation large chunk of memory which may be wasted. Also, if you keep it too small then there will be many calls to realloc. This value depends on how much data you are going to read and accordingly adjusted. We allocate memory for multiple of 16 characters to start with and we read characters from keyboard one by one till we encounter \n. Once 16 characters have been read which is determined by i%16 we allocate 16 more characters. For this we use another counter j. Finally we put 0 at the end of string to NULL terminate it, print it and free the allocated memory.

This is a very good program but in one case it will have problem. Suppose you want to read a large string and your memory is fragmented due to which one contiguous sequnce of memory is not available then you cannot read the string in to memory even though total free memory is more than memory required by string. Some languages like Erlang split memory in chunks and create a linked list to store strings. Also, this reallocation may require full scan of string which will cause :math:O(n) time cost. Therefore, there is no one shot solution to read strings into memory.

To work with strings you must know functions provided by header string.h otherwise you will be duplicating the functionality.

10.1. Useful Functions#

10.1.1. strlen and strlen Functions#

One of the most common operations is to know length of the string because it is needed as input in many functions. There are two versions of it. strlen is slightly unsecure because it depends on the \0 character of string which means if string is not NULL terminated then strlen will contnue to read past the length of string. strnlen takes an extra argument which is the maximum length of string and beyond that it will not read. I am giving signatures and descriptions below.

#include <string.h>

size_t strlen(const char *s);
size_t strnlen(const char *s, size_t maxlen);

The strlen() function calculates the length of the string s, excluding the terminating null byte (\0). The strlen() function returns the number of bytes in the string s.

The strnlen() function returns the number of bytes in the string pointed to by s, excluding the terminating null byte (\0), but at most maxlen. In doing this, strnlen() looks only at the first maxlen bytes at s and never beyond s+maxlen.

The strnlen() function returns strlen(s), if that is less than maxlen, or maxlen if there is no null byte (\0) among the first maxlen bytes pointed to by s.

Let us see examples as to how to use these:

#include <stdio.h>
#include <string.h>

int main()
{
  const char* str1 = "Hello";
  const char str2[] = "Universe";

  printf("Length of str1 is %Zd\n", strlen(str1));
  printf("Length of str2 is %Zd\n", strnlen(str2, 8));

  return 0;
}

Note the use of conversion specifier %Zd because return value of these functions is size_t. The output is:

Length of str1 is 5
Length of str2 is 8

You can also implement strlen and strnlen yourself easily. Note that if you are implementing these functions with the same name then do not include the header which has the prototype of the function in this case string.h otherwise you will have error for duplication. For example consider the following program:

#include <stdio.h>
#include <stddef.h>

size_t strlen(const char* s)
{
  size_t i=0;

  while(*s++)
    ++i;

  return i;
}

size_t strnlen(const char* s, size_t maxlen)
{
  size_t i=0;

  while(*s++ && (i < maxlen))
    ++i;

  return i;
}

int main()
{
  const char* str="Hello there!";

  printf("%Zd\n", strlen(str));
  printf("%Zd\n", strnlen(str, 20));
  printf("%Zd\n", strnlen(str, 10));

  return 0;
}

10.1.2. strcpy and strncpy Functions#

Another important operation is copying one string to another. For this we have strcpy and its secure version strncpy. You should avoid using strcpy because if destination is smaller than source then strcpy will write past the end of destination length which is a security flaw. strncpy puts additional overhead on programmer which is to provide an extra argument specifying how many bytes to be copied at max. Let us see the synopsis and description of these functions.

#include <string.h>

char *strcpy(char *dest, const char *src);

char *strncpy(char *dest, const char *src, size_t n);

The strcpy() function copies the string pointed to by src, including the terminating null byte (\0), to the buffer pointed to by dest. The strings may not overlap, and the destination string dest must be large enough to receive the copy. Beware of buffer overruns!

The strncpy() function is similar, except that at most n bytes of src are copied. Warning: If there is no null byte among the first n bytes of src, the string placed in dest will not be null-terminated.

If the length of src is less than n, strncpy() writes additional null bytes to dest to ensure that a total of n bytes are written.

The strcpy() and strncpy() functions return a pointer to the destination string dest.

Some programmers consider strncpy() to be inefficient and error prone. If the programmer knows (i.e., includes code to test!) that the size of dest is greater than the length of src, then strcpy() can be used.

One valid (and intended) use of strncpy() is to copy a C string to a fixed-length buffer while ensuring both that the buffer is not overflowed and that unused bytes in the target buffer are zeroed out (perhaps to prevent information leaks if the buffer is to be written to media or transmitted to another process via an interprocess communication technique).

If there is no terminating null byte in the first n bytes of src, strncpy() produces an unterminated string in dest. You can force termination using something like the following:

strncpy(buf, str, n);
if (n > 0)
  buf[n - 1]= '\0';

If the destination string of a strcpy() is not large enough, then anything might happen. Overflowing fixed-length string buffers is a favorite cracker technique for taking complete control of the machine. Any time a program reads or copies data into a buffer, the program first needs to check that there’s enough space. This may be unnecessary if you can show that overflow is impossible, but be careful: programs can get changed over time, in ways that may make the impossible possible.

The best idea is to have large enough buffer to hold source string and use strcpy. Thus the responsibility of ensuring this is upon you, the programmer. Let us see example programs for these two functions:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

int main()
{
  char *str1 = "Hello";
  char str2[] = "world";
  char str3[6] = {0};
  char *str4 = (char*)calloc(6, 1);

  strcpy(str3, str1);
  strncpy(str4, str2, 6);

  puts(str3);
  puts(str4);

  return 0;
}

and the output is:

Hello
world

Notice that you need to pass space including the NULL byte in strncpy call.

You can also implement your own version of strcpy and strncpy. For example,

#include <stdio.h>
#include <stdlib.h>
#include <stddef.h>

char* strncpy(char* dst, const char* src, size_t n)
{
  size_t i = 0;

  while((i++ < n) && (*dst++ = *src++));

  return dst;
}

char* strcpy(char* dst, const char* src)
{
  while(*dst++ = *src++);

  return dst;
}

int main()
{
  char *str1 = "Hello";
  char str2[] = "world";
  char str3[6] = {0};
  char *str4 = (char*)calloc(6, 1);

  strcpy(str3, str1);
  strncpy(str4, str2, 6);

  puts(str3);
  puts(str4);

  return 0;
}

and the output is:

Hello
world

10.1.3. strcat and strncat Functions#

Some high level languages like C++, Java, Python use operator overloading (which is typical to object oriented languages) and use + operator to concatenate strings. However, C is not object oriented and hence we do not have facility of operator overloading but C provides two functions strcat and strncat to achieve the same goal. Let us see their descriptions in man pages.

#include <string.h>

char *strcat(char *dest, const char *src);

char *strncat(char *dest, const char *src, size_t n);

The strcat() function appends the src string to the dest string, over-writing the terminating null byte (\0) at the end of dest, and then adds a terminating null byte. The strings may not overlap, and the dest string must have enough space for the result. If dest is not large enough, program behavior is unpredictable; buffer overruns are a favorite avenue for attacking secure programs.

The strncat() function is similar, except that

it will use at most n bytes from src; and
src does not need to be null-terminated if it contains n or more bytes

As with strcat(), the resulting string in dest is always null-terminated.

If src contains n or more bytes, strncat() writes n+1 bytes to dest (n from src plus the terminating null byte). Therefore, the size of dest must be at least strlen(dest)+n+1.

Let us see an example as how we use these functions:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

int main()
{
  char *str1 = "Hello";
  char str2[] = "world";
  char str3[12] = {0};
  char *str4 = (char*)calloc(12, 1);

  strcat(str3, str1);
  strcat(str3, " ");
  strcat(str3, str2);

  puts(str3);

  strncat(str4, str1, strlen(str1));
  strncat(str4, " ", 1);
  strncat(str4, str2, strlen(str2));

  puts(str4);

  return 0;
}

We can implement these functions similar to previously implemented functions. For example

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

char* my_strncat(char* dst, const char*src, size_t n)
{
  size_t i=0;
  size_t dst_length;

  dst_length = strlen(dst);

  while((i<n) && *src) {
     dst[dst_length + i] = *src;
     src++;
     i++;
  }
  dst[dst_length+i] = 0;

  return dst;
}

char* my_strcat(char* dst, const char*src)
{
  size_t i=0;
  size_t dst_length;

  dst_length = strlen(dst);

  while(*src) {
    dst[dst_length + i] = *src;
    src++;
    i++;
  }
  dst[dst_length+i] = 0;

  return dst;
}

int main()
{
  char *str1 = "Hello";
  char str2[] = "world";
  char str3[12] = {0};
  char *str4 = (char*)calloc(12, 1);

  my_strcat(str3, str1);
  puts(str3);
  my_strcat(str3, " ");
  puts(str3);
  my_strcat(str3, str2);
  puts(str3);

  my_strncat(str4, str1, strlen(str1));
  puts(str4);
  my_strncat(str4, " ", 1);
  puts(str4);
  my_strncat(str4, str2, strlen(str2));
  puts(str4);

  return 0;
}

10.1.4. strcmp and strncmp Functions#

There are very frequent requirements for comparison of two values when programming. Integers and characters can be compared easily. Floats can be compared with a very high degree of accuracy. However, comparing strings is to be done character by character and like strcat and strncat object oriented programming languages use same operators for strings which are used for compare integers or characters like == for equality, < for less than and so on. Once again C provides functions not simpler operator comparison to compare two strings. Let us see what man pages say about them:

#include <string.h>

int strcmp(const char *s1, const char *s2);

int strncmp(const char *s1, const char *s2, size_t n);

The strcmp() function compares the two strings s1 and s2. It returns an integer less than, equal to, or greater than zero if s1 is found, respectively, to be less than, to match, or be greater than s2.

The strncmp() function is similar, except it compares the only first (at most) n bytes of s1 and s2.

Let us see an example program for these two functions:

#include <stdio.h>
#include <string.h>

int main()
{
  char *str1 = "Hello";
  char *str2 = "world";
  char *str3 = "Helloo";

  printf("%d\n", strcmp(str1, str1));
  printf("%d\n", strcmp(str1, str2));
  printf("%d\n", strcmp(str1, str3));

  printf("%d\n", strncmp(str1, str1, 5));
  printf("%d\n", strncmp(str2, str1, 5));
  printf("%d\n", strncmp(str1, str3, 5));

  return 0;
}

and the output is:

Let us try to implement these functions ourselves:

#include <stdio.h>

int strncmp(const char* s1, const char* s2, size_t n)
{
  size_t i = 0;

  while((i < n) && *s1 && (*s1==*s2)) {
    s1++,s2++;
    i++;
  }

  return *(const unsigned char*)s1-*(const unsigned char*)s2;
}

int strcmp(const char* s1, const char* s2)
{
  while(*s1 && (*s1==*s2))
    s1++,s2++;

  return *(const unsigned char*)s1-*(const unsigned char*)s2;
}

int main()
{
  char *str1 = "Hello";
  char *str2 = "world";
  char *str3 = "Helloo";

  printf("%d\n", strcmp(str1, str1));
  printf("%d\n", strcmp(str1, str2));
  printf("%d\n", strcmp(str1, str3));

  printf("%d\n", strncmp(str1, str1, 5));
  printf("%d\n", strncmp(str2, str1, 5));
  printf("%d\n", strncmp(str1, str3, 5));

  return 0;
}

and the output is:

10.1.5. strstr, strchr and strrchr Functions#

Many times we want to find whether a particular character is in string or not. It is easy to do it in C yourself but we have two functions which help with that. Those are strchr and strchr. Many other times we want to find whether a given string is a substring of another given string. This is simple to do but most of the time those simple solutions will be inefficient. C provides a function strstr for this and compilers usually provide an implementation of a very good algorithm. I will not go into the algorithm provided by gcc now but just describe the function and its example. Let us see what man pages say about these functions:

#include <string.h>

char *strstr(const char *haystack, const char *needle);

char *strchr(const char *s, int c);

char *strrchr(const char *s, int c);

The strstr() function finds the first occurrence of the substring needle in the string haystack. The terminating null bytes (\0) are not compared. It returns a pointer to the beginning of the substring, or NULL if the substring is not found.

The strchr() function returns a pointer to the first occurrence of the character c in the string s.

The strrchr() function returns a pointer to the last occurrence of the character c in the string s.

The strchr() and strrchr() functions return a pointer to the matched character or NULL if the character is not found. The terminating null byte is considered part of the string, so that if c is specified as \0, these functions return a pointer to the terminator.

Let us see an example:

#include <stdio.h>
#include <string.h>

int main()
{
  const char *str1 = "Hello";
  const char *str3 = "Helloo";

  printf("%s\n", strchr(str1, 'e'));
  printf("%p\n", strchr(str1, 'x'));
  printf("%s\n", strchr(str1, 'l'));
  printf("%s\n", strrchr(str1, 'l'));
  printf("%s\n", strstr(str3, str1));
  printf("%p\n", strstr(str3, "xyz"));

  return 0;
}

and the output is:

ello
(nil)
llo
lo
Helloo
(nil)

Let us try to implement these three routines ourselves. Now strstr is a complex one. There are lots of very good algorithms. You can find a good list of implementations here. The algorithm which I will present will be a brute force one and should not be used for any good code. I am giving it just to present an example. Giving code for algorithms mentioned in the link is out of scope and will be covered in data structures and algorithms book.

#include <stdio.h>

char* strstr(const char *haystack, const char *needle)
{
  if (haystack == NULL || needle == NULL) {
    return NULL;
  }

  for ( ; *haystack; haystack++) {
    const char *h, *n;
    for (h = haystack, n = needle; *h && *n && (*h == *n); ++h, ++n) {
    }
    if (*n == '\0') {
      return (char*)haystack;
    }
  }
  return NULL;
}

char* strchr(const char* str, int c)
{
  char *i = NULL;

  while(*str) {
    if(*str == c) {
      i = (char*)str;
      return i;
    }
    str++;
  }
  return NULL;
}

char* strrchr(const char* str, int c)
{
  char *i = NULL;
  while(*str) {
    if(*str == c)
      i = (char*)str;
    str++;
  }

  return i;
}

int main()
{
  const char *str1 = "Hello";
  const char *str3 = "Helloo";

  printf("%s\n", strchr(str1, 'e'));
  printf("%p\n", strchr(str1, 'x'));
  printf("%s\n", strchr(str1, 'l'));
  printf("%s\n", strrchr(str1, 'l'));
  printf("%s\n", strstr(str3, str1));
  printf("%p\n", strstr(str3, "xyz"));

  return 0;
}

10.1.6. strerror Function#

strerror funciton maps an integer to error message. Typically the value of this integer comes from errno but it will accept any integer argument. A small sample program shows the messages printed.

#include <stdio.h>
#include <string.h>

int main()
{
  for(int i=0; i<135; ++i)
    printf("%d %s\n", i, strerror(i));

  return 0;
}

and the output is:

Success
Operation not permitted
No such file or directory
No such process
Interrupted system call
Input/output error
No such device or address
Argument list too long
Exec format error
Bad file descriptor
No child processes
Resource temporarily unavailable
Cannot allocate memory
Permission denied
Bad address
Block device required
Device or resource busy
File exists
Invalid cross-device link
No such device
Not a directory
Is a directory
Invalid argument
Too many open files in system
Too many open files
Inappropriate ioctl for device
Text file busy
File too large
No space left on device
Illegal seek
Read-only file system
Too many links
Broken pipe
Numerical argument out of domain
Numerical result out of range
Resource deadlock avoided
File name too long
No locks available
Function not implemented
Directory not empty
Too many levels of symbolic links
Unknown error 41
No message of desired type
Identifier removed
Channel number out of range
Level 2 not synchronized
Level 3 halted
Level 3 reset
Link number out of range
Protocol driver not attached
No CSI structure available
Level 2 halted
Invalid exchange
Invalid request descriptor
Exchange full
No anode
Invalid request code
Invalid slot
Unknown error 58
Bad font file format
Device not a stream
No data available
Timer expired
Out of streams resources
Machine is not on the network
Package not installed
Object is remote
Link has been severed
Advertise error
Srmount error
Communication error on send
Protocol error
Multihop attempted
RFS specific error
Bad message
Value too large for defined data type
Name not unique on network
File descriptor in bad state
Remote address changed
Can not access a needed shared library
Accessing a corrupted shared library
.lib section in a.out corrupted
Attempting to link in too many shared libraries
Cannot exec a shared library directly
Invalid or incomplete multibyte or wide character
Interrupted system call should be restarted
Streams pipe error
Too many users
Socket operation on non-socket
Destination address required
Message too long
Protocol wrong type for socket
Protocol not available
Protocol not supported
Socket type not supported
Operation not supported
Protocol family not supported
Address family not supported by protocol
Address already in use
Cannot assign requested address
Network is down
Network is unreachable
Network dropped connection on reset
Software caused connection abort
Connection reset by peer
No buffer space available
Transport endpoint is already connected
Transport endpoint is not connected
Cannot send after transport endpoint shutdown
Too many references: cannot splice
Connection timed out
Connection refused
Host is down
No route to host
Operation already in progress
Operation now in progress
Stale file handle
Structure needs cleaning
Not a XENIX named type file
No XENIX semaphores available
Is a named type file
Remote I/O error
Disk quota exceeded
No medium found
Wrong medium type
Operation canceled
Required key not available
Key has expired
Key has been revoked
Key was rejected by service
Owner died
State not recoverable
Operation not possible due to RF-kill
Memory page has hardware error
Unknown error 134

10.1.7. strtok Function#

There are times when we need to split a string for a set of delimiters. C provides a function called strtok. Note that strtok is not multi-thread safe so if you need to use strtok in a multi-threaded program then consider using its reentrant version strtok_r.

#include <string.h>

char *strtok(char *str, const char *delim);

char *strtok_r(char *str, const char *delim, char **saveptr);

The strtok() function breaks a string into a sequence of zero or more nonempty tokens. On the first call to strtok() the string to be parsed should be specified in str. In each subsequent call that should parse the same string, str must be NULL.

The delim argument specifies a set of bytes that delimit the tokens in the parsed string. The caller may specify different strings in delim in successive calls that parse the same string.

Each call to strtok() returns a pointer to a null-terminated string containing the next token. This string does not include the delimiting byte. If no more tokens are found, strtok() returns NULL.

A sequence of calls to strtok() that operate on the same string maintains a pointer that determines the point from which to start searching for the next token. The first call to strtok() sets this pointer to point to the first byte of the string. The start of the next token is determined by scanning forward for the next nondelimiter byte in str. If such a byte is found, it is taken as the start of the next token. If no such byte is found, then there are no more tokens, and strtok() returns NULL. (A string that is empty or that contains only delimiters will thus cause strtok() to return NULL on the first call.)

The end of each token is found by scanning forward until either the next delimiter byte is found or until the terminating null byte (\0) is encountered. If a delimiter byte is found, it is overwritten with a null byte to terminate the current token, and strtok() saves a pointer to the following byte; that pointer will be used as the starting point when searching for the next token. In this case, strtok() returns a pointer to the start of the found token.

From the above description, it follows that a sequence of two or more contiguous delimiter bytes in the parsed string is considered to be a single delimiter, and that delimiter bytes at the start or end of the string are ignored. Put another way: the tokens returned by strtok() are always nonempty strings. Thus, for example, given the string “aaa;;bbb,”, successive calls to strtok() that specify the delimiter string “;,” would return the strings “aaa” and “bbb”, and then a NULL pointer.

The strtok_r() function is a reentrant version strtok(). The saveptr argument is a pointer to a char * variable that is used internally by strtok_r() in order to maintain context between successive calls that parse the same string.

On the first call to strtok_r(), str should point to the string to be parsed, and the value of saveptr is ignored. In subsequent calls,`` str`` should be NULL, and saveptr should be unchanged since the previous call.

Different strings may be parsed concurrently using sequences of calls to strtok_r() that specify different saveptr arguments.

Let us see the example given in man page:

#include <stdio.h>
#include <stdlib.h>

char * strtok(char * str, char *comp)
{
  static int pos;
  static char *s;
  int i =0, start = pos;

  if(str!=NULL)
    s = str;

  i = 0;
  int j = 0;

  while(s[pos] != '\0')
  {
    j = 0;
    while(comp[j] != '\0')
    {
      if(s[pos] == comp[j])
      {
        s[pos] = '\0';
        pos = pos+1;
        if(s[start] != '\0')
          return (&s[start]);
        else
        {
          start = pos;
          pos--;
          break;
        }
      }
      j++;
    }
    pos++;
  }
  s[pos] = '\0';
  if(s[start] == '\0')
    return NULL;
  else
    return &s[start];
}

int main(int argc, char *argv[])
{
  char *str1, *str2, *token, *subtoken;
  int j;

  for (j = 1, str1 = argv[1]; ; j++, str1 = NULL) {
    token = strtok(str1, argv[2]);
    if (token == NULL)
      break;
    printf("%d: %s\n", j, token);
  }

  exit(EXIT_SUCCESS);
}

and the output is:

$ ./a.out 'a/bbb///cc;xxx:yyy:' ':;'
a/bbb///cc
xxx
yyy

Let us try ti implement strtok ourselves:

#include <stdio.h>
#include <stdlib.h>

char * strtok(char * str, char *delim)
{
  static int pos;
  static char *s;
  int i =0, start = pos;

  if(str!=NULL)
    s = str;

  i = 0;
  int j = 0;

  while(s[pos] != '\0')
  {
    j = 0;
    while(delim[j] != '\0')
    {
      if(s[pos] == delim[j])
      {
        s[pos] = '\0';
        pos = pos+1;
        if(s[start] != '\0')
          return (&s[start]);
        else
        {
          start = pos;
          pos--;
          break;
        }
      }
      j++;
    }
    pos++;
  }
  s[pos] = '\0';
  if(s[start] == '\0')
    return NULL;
  else
    return &s[start];
}

int main(int argc, char *argv[])
{
  char *str1, *str2, *token, *subtoken;
  int j;

  for (j = 1, str1 = argv[1]; ; j++, str1 = NULL) {
    token = strtok(str1, argv[2]);
    if (token == NULL)
      break;
    printf("%d: %s\n", j, token);
  }

  exit(EXIT_SUCCESS);
}

With this we come to an end of our discussion on strings.