How to remove duplicate lines

  • A+
Category:Languages

I am trying to create a simple program that removes duplicate lines from a file. However, I am stuck. My goal is to ultimately remove all except 1 duplicate line, different from the suggested duplicate. So, I still have that data. I would also like to make it so, it takes in the same filename and outputs the same filename. When I tried to make the filenames both the same, it just outputs an empty file.

input_file = "input.txt" output_file = "input.txt"  seen_lines = set() outfile = open(output_file, "w")  for line in open(input_file, "r"):     if line not in seen_lines:         outfile.write(line)         seen_lines.add(line)  outfile.close() 

input.txt

I really love christmas Keep the change ya filthy animal Pizza is my fav food Keep the change ya filthy animal Did someone say peanut butter? Did someone say peanut butter? Keep the change ya filthy animal 

Expected output

I really love christmas Keep the change ya filthy animal Pizza is my fav food Did someone say peanut butter? 

 


The line outfile = open(output_file, "w") truncates your file no matter what else you do. The reads that follow will find an empty file. My recommendation for doing this safely is to use a temporary file:

  1. Open a temp file for writing
  2. Process the input to the new output
  3. Close both files
  4. Move the temp file to the input file name

This is much more robust than opening the file twice for reading and writing. If anything goes wrong, you will have the original and whatever work you did so far stashed away. Your current approach can mess up your file if anything goes wrong in the process.

Here is a sample using tempfile.NamedTemporaryFile, and a with block to make sure everything is closed properly, even in case of error:

from tempfile import NamedTemporaryFile from shutil import move  input_file = "input.txt" output_file = "input.txt"  seen_lines = set()  with NamedTemporaryFile('w', delete=False) as output, open(input_file) as input:     for line in open(input_file, "r"):         sline = line.rstrip('/n')         if sline not in seen_lines:             output.write(line)             seen_lines.add(sline) move(output.name, output_file) 

The move at the end will work correctly even if the input and output names are the same, since output.name is guaranteed to be something different from both.

Note also that I'm stripping the newline from each line in the set, since the last line might not have one.

Alt Solution

If your don't care about the order of the lines, you can simplify the process somewhat by doing everything directly in memory:

input_file = "input.txt" output_file = "input.txt"  with open(input_file) as input:     unique = set(line.rstrip('/n') for line in input) with open(output_file, 'w') as output:     for line in unique:         output.write(line)         output.write('/n') 

You can compare this against

with open(input_file) as input:     unique = set(line.rstrip('/n') for line in input.readlines()) with open(output_file, 'w') as output:     output.write('/n'.join(unique)) 

The second version does exactly the same thing, but loads and writes all at once.

Comment

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: