Discussion:
Parsing file and regexp
(too old to reply)
o***@algosyn.com
2010-02-13 13:34:45 UTC
Permalink
Hello,

I need to extract info from some text files. And I want to do it with
Perl !

The file I need to parse has the following layout:

keywordA word1, word2, word3;

Here we can have some free text
...
...

keywordB word4,
word5, word6, word7, word8,
word9, word10;

KeywordA
word1, word2;

...

I want to extract all the "keywords" with their associated words.
For example, with this file, I would like to have:
keywordA: (word1, word2, word3)
keywordB: (word4, word5, word6, word7, word8, word9, word10)
keywordA: (word1, word2)

Is it possible to do this with regular expression ?
Or should I write a small parser ?

I have tried pattern matching with the 's' and also with the 'm'
option,
but with no good result ...

Thanks to help me !

Olivier
Uri Guttman
2010-02-13 19:38:46 UTC
Permalink
osc> keywordA word1, word2, word3;

osc> Here we can have some free text
osc> ...
osc> ...

osc> keywordB word4,
osc> word5, word6, word7, word8,
osc> word9, word10;

osc> KeywordA
osc> word1, word2;

osc> ...

how do you know when a keyword section begins or ends? how large is this
file? could free text have keywords? i see a ; to end a word list but
that isn't enough to properly parse this if you have 'free text'.

osc> I want to extract all the "keywords" with their associated words.
osc> For example, with this file, I would like to have:
osc> keywordA: (word1, word2, word3)
osc> keywordB: (word4, word5, word6, word7, word8, word9, word10)
osc> keywordA: (word1, word2)

osc> Is it possible to do this with regular expression ?
osc> Or should I write a small parser ?

yes and yes.

osc> I have tried pattern matching with the 's' and also with the 'm'
osc> option,
osc> but with no good result ...

please show your code. there is no way to help otherwise. s/// is not a
pattern matcher but a substitution operator. it uses regexes and can be
used to parse things.

uri
--
Uri Guttman ------ ***@stemsystems.com -------- http://www.sysarch.com --
----- Perl Code Review , Architecture, Development, Training, Support ------
--------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------
Olivier Scalbert
2010-02-14 09:46:36 UTC
Permalink
Post by Uri Guttman
how do you know when a keyword section begins or ends? how large is this
file? could free text have keywords? i see a ; to end a word list but
that isn't enough to properly parse this if you have 'free text'.
osc> Is it possible to do this with regular expression ?
osc> Or should I write a small parser ?
yes and yes.
osc> I have tried pattern matching with the 's' and also with the 'm'
osc> option,
osc> but with no good result ...
please show your code. there is no way to help otherwise. s/// is not a
pattern matcher but a substitution operator. it uses regexes and can be
used to parse things.
uri
Hi Uri,

Sorry, code is at my office !!!!

The free text can not contain keywords. And keywords start at the
beginning of a line. The list of words is terminated by a ";".

For the pattern matching I have used the option s:
m/pattern/s, to swallow the different \n.

Olivier
Olivier Scalbert
2010-02-18 12:14:05 UTC
Permalink
Post by Uri Guttman
please show your code. there is no way to help otherwise. s/// is not a
pattern matcher but a substitution operator. it uses regexes and can be
used to parse things.
uri
Here it is ...

$ cat test.txt
keyword1 word1, word2
word3;
blabla

blabla


keyword2
word4, word5,
word6, word7, word8,
word9;

bla bla
bla bla

keyword1
word10, word11;


$ cat parse.pl
use warnings;

open FILE, "< test.txt" or die "Could not open $!";
$/ = undef;
$source = <FILE>;
close(FILE);


if ($source =~ m/keyword1\s*(\w*)(,\w*)*/s) {
print("Match !\n");
print("$1\n");
print("$2\n");
}

$ perl parse.pl
Match !
word1
,


Here I would like to have 2 matches:
word1, word2
word3;
and word10, word11;



Thanks to help me !

Olivier
o***@algosyn.com
2010-02-19 07:58:06 UTC
Permalink
(Sorry but I have problem with my ISP, so I repost !)
Post by Uri Guttman
how do you know when a keyword section begins or ends? how large is this
file? could free text have keywords? i see a ; to end a word list but
that isn't enough to properly parse this if you have 'free text'.
osc> Is it possible to do this with regular expression ?
osc> Or should I write a small parser ?
yes and yes.
osc> I have tried pattern matching with the 's' and also with the 'm'
osc> option,
osc> but with no good result ...
please show your code. there is no way to help otherwise. s/// is not a
pattern matcher but a substitution operator. it uses regexes and can be
used to parse things.
uri
Hi Uri,

Sorry, code is at my office !!!!

The free text can not contain keywords. And keywords start at the
beginning of a line. The list of words is terminated by a ";".

For the pattern matching I have used the option s:
m/pattern/s, to swallow the different \n.

Olivier
o***@algosyn.com
2010-02-19 07:59:29 UTC
Permalink
Post by Uri Guttman
please show your code. there is no way to help otherwise. s/// is not a
pattern matcher but a substitution operator. it uses regexes and can be
used to parse things.
uri
Here it is ...

$ cat test.txt
keyword1 word1, word2
word3;
blabla

blabla


keyword2
word4, word5,
word6, word7, word8,
word9;

bla bla
bla bla

keyword1
word10, word11;


$ cat parse.pl
use warnings;

open FILE, "< test.txt" or die "Could not open $!";
$/ = undef;
$source = <FILE>;
close(FILE);


if ($source =~ m/keyword1\s*(\w*)(,\w*)*/s) {
print("Match !\n");
print("$1\n");
print("$2\n");
}

$ perl parse.pl
Match !
word1
,


Here I would like to have 2 matches:
word1, word2
word3;
and word10, word11;



Thanks to help me !

Olivier
Shawn H Corey
2010-02-19 19:43:31 UTC
Permalink
Post by Olivier Scalbert
$ cat test.txt
keyword1 word1, word2
word3;
blabla
blabla
keyword2
word4, word5,
word6, word7, word8,
word9;
bla bla
bla bla
keyword1
word10, word11;
#!/usr/bin/perl

use strict;
use warnings;

use Data::Dumper;

# Make Data::Dumper pretty
$Data::Dumper::Sortkeys = 1;
$Data::Dumper::Indent = 1;

# Set maximum depth for Data::Dumper, zero means unlimited
$Data::Dumper::Maxdepth = 0;

my $file = shift @ARGV;

my $source;
open my $source_fh, '<', $file or die "could not open $file: $!\n";
{
local $/;
$source = <$source_fh>;
}
close $source_fh;

my %keywords;
my @captured = $source =~ m{ ( keyword\d+ ) ( [^;]+ ) \; }gmsx;
while( @captured ){
my $keyword = shift @captured;
my $words = shift @captured;
$words =~ s{ \A \s+ }{}msx;
$words =~ s{ \s+ \z }{}msx;
my @words = split m{ \s* \, \s* }msx, $words;
push @{ $keywords{$keyword} }, @words;
}

print 'keywords: ', Dumper \%keywords;

__END__
--
Just my 0.00000002 million dollars worth,
Shawn

Programming is as much about organization and communication
as it is about coding.

I like Perl; it's the only language where you can bless your
thingy.

Eliminate software piracy: use only FLOSS.
Loading...