ansaurus

Question

What is the best way to find and replace urls on a giant text?

Answer 1

+1 A:

Regular expressions will be your best bet... maybe something like this (based on the one from strfriend)?

^((ht|f)tp(s?)\:\/\/|~/|/)?([\w]+:\w+@)?([a-zA-Z]{1}([\w\-]+\.)+([\w]{2,5}))(:[\d]{1,5})?((/?\w+/)+|/?)(\w+\.(jpg|gif|png))?

Steve Losh 2009-01-19 17:52:01

Answer 2

A:

Regular expressions are certainly one way to do it, and probably the most flexible. But if all of your image urls start with "http://www.mysite.com/" and end with ".jpg", then you can use string manipulation functions. For example, if you have a string variable called s, that you want to test:

const string mysite = "http://www.mysite.com/";
const string jpg = ".jpg";
string newString = string.Empty;
if (s.BeginsWith(mysite))
{
    if (s.EndsWith(jpg))
    {
        string textToReplace = s.SubString(mysite.Length, s.Length - mysite.Length - jpg.Length);
        newString = s.Replace(textToReplace, "whatever you want to replace it with.");
    }
}

It's a rather brute force method, but it'll work.

Jim Mischel 2009-01-19 18:01:38

Answer 3

A:

To replace all filenames by 'new_image_name_here' in image urls:

$ perl -pe's~(http://.*?/)[^/]+?\.(jpg|gif)\b~$1new_image_name_here.$2~g' huge_file.html > output.html

To replace a netloc part by 'www.othersite.org' in 'http://<netloc>/<image_path>':

$ perl -pe's~(?<=http://)[^/]+(?=/(?:[^/]+/)*[^/]+?\.(?:jpg|gif)\b)~www.othersite.org~g' huge_file.html > output.html

These regexs are simple therefore they are easily fooled. Use more specific regexs for your input data.

J.F. Sebastian 2009-01-19 18:08:38

Answer 4

A:

Honestly I think you should learn regular expressions regardless, it's a great tool to have up your sleeve especially in situations such as this. They are an extremely powerful tool for string manipulation, Perl is also a great language to learn at the same time as it makes using Reg Exps a breeze.

CalvinR 2009-01-19 18:20:10

Answer 5

A:

Really thanks guys! But I think that RegEXP will be my choice.. but I forgot to mention that I have other urls on this big text.. and I test this regexp and catches all... :( how can I be specific like, find only "www.mysite.org" inside "http://" and "/no-nono-no.jpg"?

2009-01-19 18:26:47

Answers are for answers. Delete this answer and add its content to your question (click edit below your question). See http://stackoverflow.com/faq and http://stackoverflow.com/questions/18557/how-does-stackoverflow-work-the-official-faq

J.F. Sebastian 2009-01-19 18:30:37

Answer 6

+1 A:

I'm using RegExp on EditPad Pro. I'll find a good tutorial for beginners also. Thanks for the tip @CalvinR

2009-01-19 18:28:28

You can find a good regex tutorial right in EditPad Pro's help file.

Jan Goyvaerts 2009-03-04 02:03:24

Answer 7

+1 A:

It's possible with regular expressions, but I'd probably write a Python script using Beautiful Soup:

# fix_imgs.py
import sys
from BeautifulSoup import BeautifulSoup
for filename in sys.argv[1:]:
  contents = open(filename).read()
  soup = BeautifulSoup(contents)

  # replacing each img tag
  for img in soup.findAll('img'):
    img.src = img.src.replace("http://www.mysite.com", "http://www.example.com")

  new_contents = str(soup)
  output_filename = "replaced." + filename
  open(output_filename, "w").write(new_contents)

orip 2009-01-19 18:29:04

'giant text' and 'open(filename).read()' is a bad match.

J.F. Sebastian 2009-01-19 18:56:17

@J.F.: excellent point!

orip 2009-01-19 22:42:00

ansaurus

tags:

views:

answers:

What is the best way to find and replace urls on a giant text?

related questions