Win / TheDonald
Sign In
DEFAULT COMMUNITIES Front All General AskWin Funny Technology Animals Sports Gaming DIY Health Positive Privacy
Reason: None provided.

I'm a specialist in handling large datasets without cloud stuff :D. You would be amazed what can be done on a single boxen. There are interpreted languages faster than python and I can drop down to C. I make my algos scale so should be fun. If you're scraping it that'll potentially take longer than anything else.

Are you don't anything special to compare them? I'd start off with just trimmed case insensitive lookup against first, last and DOB for a rough measure but maybe you know a better way dealing with various anomalies. I'd do it multiple ways as things like synonym lookups have lower confidence. I'll probably create a fixed combinations of scores and then rank each count on match quality. Addresses can be a PITA if you're matching those and they're just free text.

Also see if you can dedupe the strings because most the names are the same and could just be a pointer (and often is already so you'll be having two points to two separate strings with the same content that could be one), though they're probably short so you might not get much. Use a global stringdex and see if that works. It's weird. Do it like samestring = get(string), if samestring is none then put(string) otherwise string = samestring. I don't know if that works in python but it does often in other interpreted. You can probably end up with just using a few GB instead. A hash table should be fine for this.

I expect your next comment to be one of astonishment and that you thought I was just blowing my own horn.

Also don't run a full pass first time. Sampe ten thousand datapoints. Check online or something to calculate significance.

Then do an exhaustive run later in the background. This is how you turn a 1 or 2 day task into an hour or so or at least one that gives you feedback and also how you turn a cloud task into a raspberry pi task. This is why they pay the smart boys the big bucks.

Done editing u/JokerPede

111 days ago
1 score
Reason: None provided.

I'm a specialist in handling large datasets without cloud stuff :D. You would be amazed what can be done on a single boxen. There are interpreted languages faster than python and I can drop down to C. I make my algos scale so should be fun. If you're scraping it that'll potentially take longer than anything else.

Are you don't anything special to compare them? I'd start off with just trimmed case insensitive lookup against first, last and DOB for a rough measure but maybe you know a better way dealing with various anomalies. I'd do it multiple ways as things like synonym lookups have lower confidence. I'll probably create a fixed combinations of scores and then rank each count on match quality. Addresses can be a PITA if you're matching those and they're just free text.

Also see if you can dedupe the strings because most the names are the same and could just be a pointer (and often is already so you'll be having two points to two separate strings with the same content that could be one), though they're probably short so you might not get much. Use a global stringdex and see if that works. It's weird. Do it like samestring = get(string), if samestring is none then put(string) otherwise string = samestring. I don't know if that works in python but it does often in other interpreted. You can probably end up with just using a few GB instead. A hash table should be fine for this.

I expect your next comment to be one of astonishment and that you thought I was just blowing my own horn.

Also don't run a full pass first time. Sample ten thousand random datapoints. Check online or something to calculate significance.

Then do an exhaustive run later in the background. This is how you turn a 1 or 2 day task into an hour or so or at least one that gives you feedback and also how you turn a cloud task into a raspberry pi task. This is why they pay the smart boys the big bucks (no not really, I live of peanuts).

111 days ago
1 score
Reason: None provided.

I'm a specialist in handling large datasets without cloud stuff :D. You would be amazed what can be done on a single boxen. There are interpreted languages faster than python and I can drop down to C. I make my algos scale so should be fun. If you're scraping it that'll potentially take longer than anything else.

Are you don't anything special to compare them? I'd start off with just trimmed case insensitive lookup against first, last and DOB for a rough measure but maybe you know a better way dealing with various anomalies. I'd do it multiple ways as things like synonym lookups have lower confidence. I'll probably create a fixed combinations of scores and then rank each count on match quality. Addresses can be a PITA if you're matching those and they're just free text.

Also see if you can dedupe the strings because most the names are the same and could just be a pointer (and often is already so you'll be having two points to two separate strings with the same content that could be one), though they're probably short so you might not get much. Use a global stringdex and see if that works. It's weird. Do it like samestring = get(string), if samestring is none then put(string) otherwise string = samestring. I don't know if that works in python but it does often in other interpreted. You can probably end up with just using a few GB instead. A hash table should be fine for this.

I expect your next comment to be one of astonishment and that you thought I was just blowing my own horn.

Also don't run a full pass first time. Sampe ten thousand datapoints. Check online or something to calculate significance.

Then do an exhaustive run later in the background. This is how you turn a 1 or 2 day task into an hour or so or at least one that gives you feedback and also how you turn a cloud task into a raspberry pi task. This is why they pay the smart boys the big bucks (no not really, I live of peanuts).

111 days ago
1 score
Reason: None provided.

I'm a specialist in handling large datasets without cloud stuff :D. You would be amazed what can be done on a single boxen. There are interpreted languages faster than python and I can drop down to C. I make my algos scale so should be fun. If you're scraping it that'll potentially take longer than anything else.

Are you don't anything special to compare them? I'd start off with just trimmed case insensitive lookup against first, last and DOB for a rough measure but maybe you know a better way dealing with various anomalies. I'd do it multiple ways as things like synonym lookups have lower confidence. I'll probably create a fixed combinations of scores and then rank each count on match quality. Addresses can be a PITA if you're matching those and they're just free text.

Also see if you can dedupe the strings because most the names are the same and could just be a pointer (and often is already so you'll be having two points to two separate strings with the same content that could be one), though they're probably short so you might not get much. Use a global stringdex and see if that works. It's weird. Do it like samestring = get(string), if samestring is none then put(string) otherwise string = samestring. I don't know if that works in python but it does often in other interpreted. You can probably end up with just using a few GB instead. A hash table should be fine for this.

I expect your next comment to be one of astonishment and that you thought I was just blowing my own horn.

Also don't run a full pass first time. Sampe ten thousand datapoints. Check online or something to calculate significance.

Then do an exhaustive run later in the background. This is how you turn a 1 or 2 day task into an hour or so or at least one that gives you feedback and also how you turn a cloud task into a raspberry pi task. This is why they pay the smart boys the big bucks.

111 days ago
1 score
Reason: None provided.

I'm a specialist in handling large datasets without cloud stuff :D. You would be amazed what can be done on a single boxen. There are interpreted languages faster than python and I can drop down to C. I make my algos scale so should be fun. If you're scraping it that'll potentially take longer than anything else.

Are you don't anything special to compare them? I'd start off with just trimmed case insensitive lookup against first, last and DOB for a rough measure but maybe you know a better way dealing with various anomalies. I'd do it multiple ways as things like synonym lookups have lower confidence. I'll probably create a fixed combinations of scores and then rank each count on match quality. Addresses can be a PITA if you're matching those and they're just free text.

Also see if you can dedupe the strings because most the names are the same and could just be a pointer (and often is already so you'll be having two points to two separate strings with the same content that could be one), though they're probably short so you might not get much. Use a global stringdex and see if that works. It's weird. Do it like samestring = get(string), if samestring is none then put(string) otherwise string = samestring. I don't know if that works in python but it does often in other interpreted. You can probably end up with just using a few GB instead. A hash table should be fine for this.

I expect your next comment to be one of astonishment and that you thought I was just blowing my own horn.

Also don't run a full pass first time. Sampe ten thousand datapoints. Check online or something to calculate significance.

Then do an exhaustive run later in the background. This is how you turn a 1 or 2 day task into an hour or so or at least one that gives you feedback and also how you turn a cloud task into a raspberry pi task.

111 days ago
1 score
Reason: None provided.

I'm a specialist in handling large datasets without cloud stuff :D. You would be amazed what can be done on a single boxen. There are interpreted languages faster than python and I can drop down to C. I make my algos scale so should be fun. If you're scraping it that'll potentially take longer than anything else.

Are you don't anything special to compare them? I'd start off with just trimmed case insensitive lookup against first, last and DOB for a rough measure but maybe you know a better way dealing with various anomalies. I'd do it multiple ways as things like synonym lookups have lower confidence. I'll probably create a fixed combinations of scores and then rank each count on match quality. Addresses can be a PITA if you're matching those and they're just free text.

Also see if you can dedupe the strings because most the names are the same and could just be a pointer (and often is already so you'll be having two points to two separate strings with the same content that could be one), though they're probably short so you might not get much. Use a global stringdex and see if that works. It's weird. Do it like samestring = get(string), if samestring is none then put(string) otherwise string = samestring. I don't know if that works in python but it does often in other interpreted. You can probably end up with just using a few GB instead. A hash table should be fine for this.

I expect your next comment to be one of astonishment and that you thought I was just blowing my own horn.

111 days ago
1 score
Reason: None provided.

I'm a specialist in handling large datasets without cloud stuff :D. You would be amazed what can be done on a single boxen. There are interpreted languages faster than python and I can drop down to C. I make my algos scale so should be fun. If you're scraping it that'll potentially take longer than anything else.

Are you don't anything special to compare them? I'd start off with just trimmed case insensitive lookup against first, last and DOB for a rough measure but maybe you know a better way dealing with various anomalies. I'd do it multiple ways as things like synonym lookups have lower confidence. I'll probably create a fixed combinations of scores and then rank each count on match quality. Addresses can be a PITA if you're matching those and they're just free text.

Also see if you can dedupe the strings because most the names are the same and could just be a pointer (and often is already so you'll be having two points to two separate strings with the same content that could be one), though they're probably short so you might not get much. Use a global stringdex and see if that works. It's weird. Do it like samestring = get(string), if samestring is none then put(string) otherwise string = samestring. I don't know if that works in python but it does often in other interpreted. You can probably end up with just using a few GB instead.

I expect your next comment to be one of astonishment and that you thought I was just blowing my own horn.

111 days ago
1 score
Reason: None provided.

I'm a specialist in handling large datasets without cloud stuff :D. You would be amazed what can be done on a single boxen. There are interpreted languages faster than python and I can drop down to C. I make my algos scale so should be fun. If you're scraping it that'll potentially take longer than anything else.

Are you don't anything special to compare them? I'd start off with just trimmed case insensitive lookup against first, last and DOB for a rough measure but maybe you know a better way dealing with various anomalies. I'd do it multiple ways as things like synonym lookups have lower confidence. I'll probably create a fixed combinations of scores and then rank each count on match quality. Addresses can be a PITA if you're matching those and they're just free text.

Also see if you can dedupe the strings because most the names are the same and could just be a pointer (and often is already so you'll be having two points to two separate strings with the same content that could be one), though they're probably short so you might not get much. Use a global stringdex and see if that works. It's weird. Do it like samestring = get(string), if samestring is none then put(string) otherwise string = samestring. I don't know if that works in python but it does often in other interpreted. You can probably end up with just using a few GB instead.

111 days ago
1 score
Reason: None provided.

I'm a specialist in handling large datasets without cloud stuff :D. You would be amazed what can be done on a single boxen. There are interpreted languages faster than python and I can drop down to C. I make my algos scale so should be fun. If you're scraping it that'll potentially take longer than anything else.

Are you don't anything special to compare them? I'd start off with just trimmed case insensitive lookup against first, last and DOB for a rough measure but maybe you know a better way dealing with various anomalies. I'd do it multiple ways as things like synonym lookups have lower confidence. I'll probably create a fixed combinations of scores and then rank each count on match quality. Addresses can be a PITA if you're matching those and they're just free text.

Also see if you can dedupe the strings because most the names are the same and could just be a pointer (and often is already so you'll be having two points to two separate strings with the same content that could be one), though they're probably short so you might not get much. Use a global stringdex and see if that works. It's weird. Do it like samestring = get(string), if samestring is none then put(string) otherwise string = samestring. I don't know if that works in python but it does often in other interpreted.

111 days ago
1 score
Reason: None provided.

I'm a specialist in handling large datasets without cloud stuff :D. You would be amazed what can be done on a single boxen. There are interpreted languages faster than python and I can drop down to C. I make my algos scale so should be fun. If you're scraping it that'll potentially take longer than anything else.

Are you don't anything special to compare them? I'd start off with just trimmed case insensitive lookup against first, last and DOB for a rough measure but maybe you know a better way dealing with various anomalies. I'd do it multiple ways as things like synonym lookups have lower confidence. I'll probably create a fixed combinations of scores and then rank each count on match quality. Addresses can be a PITA if you're matching those and they're just free text.

Also see if you can dedupe the strings because most the names are the same and could just be a pointer (and often is already so you'll be having two points to two separate strings with the same content that could be one), though they're probably short so you might not get much.

111 days ago
1 score
Reason: None provided.

I'm a specialist in handling large datasets without cloud stuff :D. You would be amazed what can be done on a single boxen. There are interpreted languages faster than python and I can drop down to C. I make my algos scale so should be fun. If you're scraping it that'll potentially take longer than anything else.

Are you don't anything special to compare them? I'd start off with just trimmed case insensitive lookup against first, last and DOB for a rough measure but maybe you know a better way dealing with various anomalies. I'd do it multiple ways as things like synonym lookups have lower confidence. I'll probably create a fixed combinations of scores and then rank each count on match quality. Addresses can be a PITA if you're matching those and they're just free text.

Also see if you can dedupe the strings because most the names are the same and could just be a pointer, though they're probably short so you might not get much.

111 days ago
1 score
Reason: None provided.

I'm a specialist in handling large datasets without cloud stuff :D. You would be amazed what can be done on a single boxen. There are interpreted languages faster than python and I can drop down to C. I make my algos scale so should be fun. If you're scraping it that'll potentially take longer than anything else.

Are you don't anything special to compare them? I'd start off with just trimmed case insensitive lookup against first, last and DOB for a rough measure but maybe you know a better way dealing with various anomalies. I'd do it multiple ways as things like synonym lookups have lower confidence. I'll probably create a fixed combinations of scores and then rank each count on match quality. Addresses can be a PITA if you're matching those and they're just free text.

Also see if you can dedupe the primatives / string because most the names are the same and could just be a pointer, though they're probably short so you might not get much.

111 days ago
1 score
Reason: Original

I'm a specialist in handling large datasets without cloud stuff :D. You would be amazed what can be done on a single boxen. There are interpreted languages faster than python and I can drop down to C. I make my algos scale so should be fun. If you're scraping it that'll potentially take longer than anything else.

Are you don't anything special to compare them? I'd start off with just trimmed case insensitive lookup against first, last and DOB for a rough measure but maybe you know a better way dealing with various anomalies. I'd do it multiple ways as things like synonym lookups have lower confidence. I'll probably create a fixed combinations of scores and then rank each count on match quality. Addresses can be a PITA if you're matching those and they're just free text.

111 days ago
1 score