Numerical string sorting in Clojure

Time for a bit of functional programming nostalgia. As Wikipedia mentions – http://en.wikipedia.org/wiki/Collation#Issues_with_numbers and hundreds of posts from interested people all over the Internet confirm – it is sometimes useful to have strings sorted using their numerical components.

The simple algorithm below handles this task as it seems most reasonable to me, namely putting strings with numbers first and among them sorting according to consecutive numbers. Strings are also sorted alphabetically first to preserve alphabetical order between numerically equal strings.

Notice the use of group-by to sort strings within groups containing the same number of numerical components. This is necessary to ensure the correct ordering!

The algorithm hasn’t been optimized for performance yet, although executing it from REPL on 2800 entries yields 66ms execution time on my mid-2013 maxed-out 13″ MacBook Air. Not too bad taking into account all those collection manipulations. Going one order of magnitude higher, to 28000 strings, yields 3364ms. So the complexity seems quadratic sorry. But still should be acceptable for most uses, like sorting filenames.

UPDATE: After replacing nth with get, and adding vec to the composition it’s 300ms now for 28000 strings. nth is evil! For educational note, nth is actually also performing well on vectors but get returns nil on non-vectors and this helped me identify that result of sort is a sequence which was ruining everything.

(ns eu.algoholic.smart_sort)

(def n_lst_cmp (comp
	(fn [a] (if (empty? a) false (< (first (first a)) (second (first a)))))
	(fn [a] (drop-while #(== (first %) (second %)) a))
	(fn [a,b] (map vector a b))
))

(def smart_sort (comp
	(fn [a] (sort n_lst_cmp a))
	(fn [a] (reduce concat (map second (sort a))))
	(fn [a] (zipmap (map #(if (== % 0) Integer/MAX_VALUE %) (keys a)) (map #(sort n_lst_cmp %) (vals a))))
	(fn [a] (group-by count a))
	(fn [a] (map-indexed (fn [b,c] (with-meta (map #(Integer/parseInt %) c) {:path (get a b) :idx b})) (map #(re-seq #"[0-9]+" %) a)))
	vec
	sort
))

(defn get_meta [a,k] (map #(get (meta %) k) a))

For this input:

(def s ‘[“ala_10_8_6” “ma_9_4_4” “kota_8_3_7” “a_4_5_8” “kot_7_10_0” “ma_4_8_2” “ale_2_4_9” “ale_10_7_1” “ala_0_7” “ma_9_7” “aids_8_10” “kota” “i_3_9” “nie_3_1” “chce_3_6” “si_8_9” “ala” “leczyc_6” “lorem_2” “ipsum_2” “dolor_1” “sit_0” “amet_1” “adpiscin_3” “elit_2” “piscina_10” “uomini_9” “ma”])

with the following code:

(require '[eu.algoholic.smart_sort :as ss])
(def w (ss/smart_sort s))
(ss/get_meta w :path)

we get the following output:

(“sit_0” “ala_0_7” “amet_1” “dolor_1” “elit_2” “ipsum_2” “lorem_2” “ale_2_4_9” “adpiscin_3” “nie_3_1” “chce_3_6” “i_3_9” “a_4_5_8” “ma_4_8_2” “leczyc_6” “kot_7_10_0” “kota_8_3_7” “si_8_9” “aids_8_10” “uomini_9” “ma_9_4_4” “ma_9_7” “piscina_10” “ale_10_7_1” “ala_10_8_6” “ala” “kota” “ma”)

Cool, huh? Enjoy. Also, if you have an idea how to make it faster – let me know!

Leave a Reply

Your email address will not be published. Required fields are marked *