tags:

views:

104

answers:

7

I have I huge backup of posts of my blog. All posts has images like:

"http://www.mysite.com/nonono-nonono.jpg"

or

"http://www.mysite.com/nonono-nonono.gif"

or even

"http://www.mysite.com/nonono.jpg"

But I have other links for urls on the same domain like ""http://www.mysite.com/category/post.html" and I just want to replace urls for the images (luckly all images are on the root of the website).

I need to learn RegExp to do that? Is there any powerful tool to find and replace texts like this? Thanks

+1  A: 

Regular expressions will be your best bet... maybe something like this (based on the one from strfriend)?

^((ht|f)tp(s?)\:\/\/|~/|/)?([\w]+:\w+@)?([a-zA-Z]{1}([\w\-]+\.)+([\w]{2,5}))(:[\d]{1,5})?((/?\w+/)+|/?)(\w+\.(jpg|gif|png))?
Steve Losh
A: 

Regular expressions are certainly one way to do it, and probably the most flexible. But if all of your image urls start with "http://www.mysite.com/" and end with ".jpg", then you can use string manipulation functions. For example, if you have a string variable called s, that you want to test:

const string mysite = "http://www.mysite.com/";
const string jpg = ".jpg";
string newString = string.Empty;
if (s.BeginsWith(mysite))
{
    if (s.EndsWith(jpg))
    {
        string textToReplace = s.SubString(mysite.Length, s.Length - mysite.Length - jpg.Length);
        newString = s.Replace(textToReplace, "whatever you want to replace it with.");
    }
}

It's a rather brute force method, but it'll work.

Jim Mischel
A: 

To replace all filenames by 'new_image_name_here' in image urls:

$ perl -pe's~(http://.*?/)[^/]+?\.(jpg|gif)\b~$1new_image_name_here.$2~g' huge_file.html > output.html

To replace a netloc part by 'www.othersite.org' in 'http://<netloc>/<image_path>':

$ perl -pe's~(?<=http://)[^/]+(?=/(?:[^/]+/)*[^/]+?\.(?:jpg|gif)\b)~www.othersite.org~g' huge_file.html > output.html

These regexs are simple therefore they are easily fooled. Use more specific regexs for your input data.

J.F. Sebastian
A: 

Honestly I think you should learn regular expressions regardless, it's a great tool to have up your sleeve especially in situations such as this. They are an extremely powerful tool for string manipulation, Perl is also a great language to learn at the same time as it makes using Reg Exps a breeze.

CalvinR
A: 

Really thanks guys! But I think that RegEXP will be my choice.. but I forgot to mention that I have other urls on this big text.. and I test this regexp and catches all... :( how can I be specific like, find only "www.mysite.org" inside "http://" and "/no-nono-no.jpg"?

Answers are for answers. Delete this answer and add its content to your question (click edit below your question). See http://stackoverflow.com/faq and http://stackoverflow.com/questions/18557/how-does-stackoverflow-work-the-official-faq
J.F. Sebastian
+1  A: 

I'm using RegExp on EditPad Pro. I'll find a good tutorial for beginners also. Thanks for the tip @CalvinR

You can find a good regex tutorial right in EditPad Pro's help file.
Jan Goyvaerts
+1  A: 

It's possible with regular expressions, but I'd probably write a Python script using Beautiful Soup:

# fix_imgs.py
import sys
from BeautifulSoup import BeautifulSoup
for filename in sys.argv[1:]:
  contents = open(filename).read()
  soup = BeautifulSoup(contents)

  # replacing each img tag
  for img in soup.findAll('img'):
    img.src = img.src.replace("http://www.mysite.com", "http://www.example.com")

  new_contents = str(soup)
  output_filename = "replaced." + filename
  open(output_filename, "w").write(new_contents)
orip
'giant text' and 'open(filename).read()' is a bad match.
J.F. Sebastian
@J.F.: excellent point!
orip