Geekery

[Update] – a comment left by TLP gives a much better solution to the problem that seems to be better in benchmarks as well. I’m changing the post to reflect the best and fastest method for the situation described.

In a previous post, I described a situation where we needed to remove a repeating dot in a user name. In this article, I mentioned the first site that came up when searching in Google to find a solution.

I thought that I might have come across something that would be a solution for what this person was looking for. However, they clarified what they needed the function to do on their website and left me a comment with some more info:

“Thanks for citing us but the article was about making a unique chars string. For example: “aabbccaaaaaddee” will become “abcade“. That is what I accomplished in the article. I know very well regular expression and there are some way of accomplishing that but not in one step. Every char of the string must be separated by a separator (ex: comma, pipe, etc) and then apply the lookback regular expression.

The beauty of using regular expressions is this: a lot of steps that are needed to accomplish some string formating/parsing can be done in one step using RegExp. But if there are the same amount of steps it will be faster not to use RegExp.”

I love programming and I definitely love a challenge and so, webdev.andrei: I accept your challenge.

$str = 'aabbccaaaaaddee';
echo preg_replace('{(.)\1+}','$1',$str);
//abcade

This type of simple expression is thanks to a commenter that dropped by named TLP. We’ll try break it down to see how it works. The curly braces {(.)\1+} without a number in the middle means to repeat it as many times as needed until it no longer occurs. The round brackets {(.)\1+} create something called a backreference. A backreference allows it to reuse part of the regular expression match in the expression itself. So, when it comes to the first character repeating more than once, it will replace it with a single version of itself (or $1). It then places itself back in at the location of the backslash-1 {(.)\1+}.

Throughout the night until the wee hours of the morning I was furthering my ‘regex-fu’. I ultimately came to a simple loop that seems to satisfy the issue.

$string = 'aabbccaaaaaddee';
$new_string = '';
$starting_char = 0;
while (strlen($string) > 0 && $starting_char < strlen($string)) {
    $blah = preg_match('/[A-z]{2,}/', $string, $matches);
    $letter = $matches[0][$starting_char];
    $new_string .= $letter;
    $regex = '/' . $letter . '{2,}/';
    $string = preg_replace($regex, $letter, $string);
    $starting_char++;
}
echo $new_string;

In short: it tries to find a repeating character. When it finds one, it replaces it with a single version of itself. Now that it knows the first repeating instance is now a single character in the string, it can move on to the next character of the string and try that one out. It does this until there are only single instances of each character left in the string.

And there you have it. This could be done with any number of different characters by altering the first line of the loop and inputting a different range of characters instead of [A-z].

I hope this helps others and especially hope that webdev.andrei can get some use out of it.

If you liked this post, then please be sure to subscribe to my feed.

  • Still not satisfied. Although you used Regular expressions there is a loop as I have a loop. But your loop executes two RegExp whilst mine executes only simple comparison and a simple push to an array. Everyone can see very easily that not using RegExp to accomplish this will be faster and easier.

    I want to mention again that “the beauty of using regular expressions is this: a lot of steps that are needed to accomplish some string formating/parsing can be done in one step using RegExp. But if there are the same amount of steps it will be faster not to use RegExp.

  • Ahhh… I understand a bit now what you were looking for.

    I didn’t realize that you were looking to have a single regular expression do all the replacements necessary. I just assumed you were looking for a way to do the same actions using regex instead of the loop/array structure you created.

    But now I know… and knowing is half the battle.

  • TLP

    Here’s a better method using a single regular expression thanks to backreferences.

    $str = ‘aabbccaaaaaddee’;
    echo preg_replace(‘{(.)\1+}’,’$1′,$str);
    //abcade

    I converted Andrei’s version from Actionscript to PHP to benchmark it with the other 2 versions. I timed each doing 10,000 runs.

    Admin – 1.226 seconds
    Andrei – 0.492 seconds
    Mine – 0.113 seconds

  • TLP thanks but is NOT what I needed. The citation is taken out of the context and you couldn’t fully understand my problem which is explained here: http://www.flexer.info/2008/03/11/remove-duplicate-chars-from-a-string/

    So what I needed is a string like “aabbccaaaaaddee” to be transformed to “abcde“.

    I tried with regular expression but there is no easy way (one step).

    Thanks for benchmarks. Great job. If you could remake your solution to accomplish the transformation (even with more steps) and do the benchmark again it will be great.

  • To explain it even better… so what I needed is a string like “aabbccaaaaaddee” to be transformed to “abcde” (please note that there are two groups of “b>a“s in the first string and only one “a” in the resulting group).

    Using “aabbccaaaaaddee” in your regular expression will give us the result “abcade” which is different from “abcde” (note the underlined letters). I need unique chars in the resulting string.

  • It seems that in the comment is not showing the underlines. Please see the comment also http://www.flexer.info/2008/03/11/remove-duplicate-chars-from-a-string/#comment-173

  • Thanks for the discussion and the help TLP. I updated the post to reflect your better version.

  • For user input I need people to be able to write ‘apple’ or ‘haaalleluja’. Therefore stripping any repetition is a bit too strict for my purpose. So I use this one to prevent chars repeated more than 3 times with any whitespace (prior to this, I strip any whitespace down to only one at a time):

    preg_replace(‘{( ?.)\1{4,}}’,’$1$1$1′,$str);

    The info on this page was a great inspiration for this… Thanks! :-)

  • Pingback: استفسار بخصوص تكرار الأحرف ! - سوالف سوفت()

  • Actually, no, this will not work.

    The desired return string from your input (aabbccaaaaddaaeeff), would be:
    abcadaef

    The count_chars function only returns all the individual characters used without repeats and does not preserve order:
    abcdef

    Close — but no cigar.

  • Lonnie Coffman

    Using PHP this can be accomplished easily using the count_chars() function:

    $string = ‘aabbccaaaaddaaeeff’;
    $value = count_chars($string, 3);
    echo $value;

  • Lonnie Coffman

    Mode number 3 for the count_chars function returns a string containing all unique characters. If you run the code it will return ‘abcdef’ not ‘abcadaef’.

    You are correct that it does not preserve order though. The results are returned in alphabetical order.

  • raj

    hot to find the first duplicate character in a given string.
    e.g., abccccdddeee
    out put should be
    c

    plz help me out

  • Prema

    PHP String functions to eliminate repeating character

    $string = “1122337749903”;
    $value = count_chars($string, 3);
    echo $value; //12374903

  • little bit of it I have you bookmarked so I can find your latest posts

  • If we are using loop in regx, then how would it be faster than str_replace function?
    Thanks,