Files are identical


















Message 3 of 5. In response to Sumen. Sumen , This is a good idea. Message 4 of 5. In response to martinav. Hi martinav , Yeah, the strings are terribly long. Hope to see how this turns out! Message 5 of 5. Post Reply. Helpful resources. Launching new user group features Learn how to create your own user groups today! Learn More. View Now. View All. Top Solution Authors.

User Count. Top Kudoed Authors. In my opinion, this is a file-system operation. So first, choose your filesystem with care. Next, deduplicate. Then compare inodes. MD5 hash would be faster than comparison, but slower than a normal CRC-check. You have to figure out the kind of reliability you want in comparison. Stack Overflow for Teams — Collaborate and share knowledge with a private group.

Create a free Team What is Teams? Collectives on Stack Overflow. Learn more. What is the fastest way to check if files are identical? Ask Question. Asked 12 years, 8 months ago. Active 1 year, 10 months ago.

Viewed 41k times. Update: I know about generating checksums. I want speed. Update: Is this ANY different than the fastest way to compare two files? Improve this question. If you just want to know if two files are identical, use cmp. You mention that they are Java files. Do you need a tool that can also ignore whitespace and formatting differences?

Lets say you run a program a million times You want to compare the million different outputs Add a comment. Active Oldest Votes. OP Modification : Lifted up important comment from Mark Bessey "another obvious optimization if the files are expected to be mostly identical, and if they're relatively small, is to keep one of the files entirely in memory.

Improve this answer. Community Bot 1 1 1 silver badge. David Z David Z k 26 26 gold badges silver badges bronze badges. I think compare has a setting to give up one a difference is found. I wonder whether the file system could give you cheap checksums if one used something like ZFS. There's no faster way to prove that two files are identical than comparing them byte-for-byte duh , but for typical files there are probabilistically faster ways to prove that they are not the same.

Sampling the beginning and end of the files first would find differences faster if the files are mostly the same, but different in only one part usually the beginning or end. I provided a link to the Knuth-Morris-Pratt method of solving this exact problem yes, Donald Knuth back in but apparently people only read the top answers because this obviously better answer got its first upvote today.

Show 2 more comments. Doug Bennett Doug Bennett 3 3 bronze badges. Doug, Doesn't make sense, we still need to calculate the checksum for all one million files.

Calculating the checksum would take much much more time than direct comparison. Thus you can pick a single file against which all n-1 files can be compared. Thus you only need to read, at most, all n files to the end if they are in fact all identical. If the OP had asked which files are identical I'd have calculated checksums and sorted them to group files with the same checksum.

Michael Burr Michael Burr k 49 49 gold badges silver badges bronze badges. Checking that the file sizes are the same first is exactly the kind of thing that's so obvious it's easy to forget to do it — thanks! Peter Wone Peter Wone EDIT: Some people have expressed surprise end even doubt that it could be faster to read the files twice than reading them just once.

However, this has at best marginal relevance to the original question. I was a bit annoyed, actually. Even if the files were cached, it must take longer to read all the bytes to hash them and then reprocess them to compare than it would to just read and compare them to start with.

As well, the compare can abort on first mismatch. Therefore, your test was faulty or your measurement of the time taken was. Given two files of 1,, bytes, doing 1,, x "if chr1! Thomas: What buffer size did you use for reading the files for the comparison? The first two comments above by Software Monkey: Remember the caching. Reading two files sequentially into memory from the physical disk can be faster than reading them both in parallel, alternating between them moving the read head back and forth.

Everything you do later, with all the data cached in memory, is relatively much faster. But yes, it depends on the data, and this is an average. Two files that actually do differ in the beginning will be faster to compare byte by byte. Show 3 more comments.

Do you really have to read every byte of every file? Can you stop reading if one file falls out of favor? For identical files, you have to read the lot even for cmp.

Well the most optimal algorithm will depend on the number of duplicate files. Why not put a check in the script to see if the file is already all lower case, if so skip it. This makes the most sense to me. First using dd to do case-conversion is using a sledge hammer to swat flies. Use the shell's typeset -l facility or tr "[A-Z]" "[a-z]". Next test to see if the lower-cased name is actually different. Finally before you mv, you should see if the file already exists so that you don't destroy data.

Instead of using mv and dd, try this script: Rgds I know you'l help me!!! Both got 10!!!



0コメント

  • 1000 / 1000